/ ( 

PATENT 
20002/16136 

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

Applicant: ) 
Mingqiu Sun ) 

Serial No.: 10/424,356 ) 

For: METHODS AND I 
APPARATUS TO DETECT i 
PATTERNS IN PROGRAMS ( 

Filed: April 28 2003 }. 

Group Art Unit: 2192 ) 

Examiner: John J. Romano ) 

DECLARATION OF JAMES A. FLIGHT 

Mail Stop Amendment 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 

I, James A. Flight, hereby declare and state: 

1 . I am attorney of record for the above-referenced patent application. 

2. The United States Patent & Trademark Office issued an Office 
action dated March 21, 2006, purporting to reject all pending claims of the 
above-referenced patent application based on Sherwood et al., Phase 
Tracking and Prediction, technical report CS2002-0710, UC San Diego, 
June 2002. That technical report can be found on the Internet at 
http://www.cse.ucsd.edU/Dienst/UI/2.0/Describe/ncstrl.ucsd cse/CS2002- 
0710 . A copy of the technical report is attached hereto as Exhibit 1. 



U.S. Serial No. iu/424,356 
Rule 131 Declaration 

3. As shown in Exhibit 1, the technical report includes two links. 
Both of those links provide the ability to download a 10Kb postscript file. 
A copy of that postscript file is attached hereto as Exhibit 2. 

4. Exhibit 2 states "To obtain a copy of this techreport, please look 
for it at the following site: 

http://www-cse.ucsd.edit/users/calder/papers.html 
Or send email or letter to: 

Brad Calder... calder@cs.ucsd.edu " Following the link in Exhibit 2 leads 
to Exhibit 3 . 

5 . Exhibit 3 does not list the technical report as published in 2002. 
However, it identifies a corresponding paper as published in June of 2003, 
namely, Timothy Sherwood, Suleyman Sair, and Brad Calder, Phase 
Tracking and Prediction. 30th International Symposium on Computer 
Architecture, June 2003. The PDF of this June of 2003 paper is linked in 
Exhibit 3. A copy of this June of 2003 paper is attached hereto as Exhibit 
4. Exhibit 4 identifies its date of publication as June of 2003. 

6. The Office action of March 21, 2006 purports to reject the claims 
noted in paragraph 2 above based on the technical report of 2002. 
However, it actually relies on the content of the June of 2003 publication 
(i.e., Exhibit 4) to support the rejections. As noted above, the technical 
report of 2002 (Exhibit 1) includes only an abstract. It does not include 
Exhibit 4. 

7. Because of the evident inconsistency between Exhibit 1 and 
Exhibit 4, 1 sent an email to Timothy Sherwood, the first listed author in 



-2- 



U.S. Serial No! A u/424,356 
Rule 131 Declaration 



Exhibits 1 and 4 on October 30, 2006 asking Mr. Sherwood to identify the 
correct date of publication for the full article. As shown in Exhibit 5, Mr. 
Sherwood replied to my email on October 30, 2006 by stating: 
Hi Jim, 

The publication appeared in ISCA 2003 and so was officially 
published on June 9th, 2003. 

http://cs.nyu.edu/isca03/ 

-Tim 

8. The citation in Mr. Sherwood's email (Exhibit 5) is to the program 
announcement shown in Exhibit 6. That program announcement states 
"The thirtieth International Symposium on Computer Architecture (ISCA) 
will be held at the Town and Country Hotel in San Diego 9-1 1 June, 
2003." Thus, Mr. Sherwood, the first named author of Exhibits 1 and 4 
indicated that the full publication was made available in June of 2003, 
which is after the April 28, 2003 filing date of the instant application. 

9. Having no reason whatsoever to doubt Mr. Sherwood's testimony, 
I prepared and filed a response to the Office action dated March 21, 2006 
attaching a copy of Mr. Sherwood's email and explaining that, based on 
the author's testimony, the full Sherwood article (i.e., Exhibit 4) is not 
prior art to the instant application. 

10. The USPTO issued a final Office action on May 16, 2007 refusing 
to accept the author's testimony because it was not in the form of an 
affidavit or declaration. Accordingly, I am submitting this requested 
declaration to verify the source of the emails noted above. 

1 1 . The final Office action attempts to locate evidence that the author, 
Mr. Sherwood, is incorrect in his belief that the publication of Exhibit 4 

-3- 



U.S. Serial No. ^424,356 
Rule 131 Declaration 



did not occur until after the filing date of the instant application. In 
particular, the final Office action cites four pieces of evidence to allegedly 
support an earlier publication date for the full article. 

12. First, the final Office action cites the publication "ACM SIGARCH 
Computer Architecture News, archive volume 31, Issue 2 (May 2003), 
pages 336-349, ISSN:0 163-5964" (attached hereto as Exhibit 7) as 
evidencing publication prior to June of 2003. Exhibit 7 does indeed note 
that the Sherwood article was published in May of 2003. However, May 
of 2003 is still after the April 28, 2003 filing date of this application. 
Therefore, Exhibit 7 fails to make the full Sherwood article (Exhibit 4) 
prior art to this application. A copy of the Sherwood article as linked to in 
Exhibit 7 is attached hereto as Exhibit 8. 

13. The final Office action also cites to Exhibits 9 and 10 (reports by 
year and author, respectively) which identify the date of the technical 
report (i.e., Exhibit 1) as June 23, 2002, However, as noted above, the 
technical report is not the full article (i.e., it is not Exhibits 4 or 8) and, 
thus, Exhibits 9 and 10 only serve to document Exhibit 1 as prior art, not 
the full article (i.e., not Exhibits 4 or 8). 

14. Finally, the Office action cites to footnote 20 of "Automatically, 
Characterizing Large Scale Program Behavior," 2002, ISBN 1-581 13-574- 
2 (Exhibit 1 1) as evidence of the availability of the technical report (i.e., 
Exhibit 1), Again, however, Exhibit 1 is not Exhibits 4 or 8. and, thus, the 
demonstration of the 2002 publication of the technical report does not 
make Exhibits 4 or 8 prior art. Thus, the evidence of record relied upon by 
the final Office action does not demonstrate Exhibits 4 and/or 8 to be prior 



_4- 



U.S. Serial No. au/424,3S6 
Rule 131 Declaration 



( 



art to the instant application. Accordingly, the final Office action's 
reliance on the content of Exhibits 4 and/or 8 to reject the claims is 
improper. 

15. To attempt to determine the earliest publication date of the full 
article (i.e., Exhibits 4 and/or 8), the undersigned searched the Internet 
Archive Wayback Machine (http://www.archive.org/index.php) for the 
technical report (i.e., Exhibit 1) . As shown in Exhibit 12, that search 
demonstrated that only an abstract of the Sherwood article (i.e., only 
Exhibit I) was present on the Internet as of November 19, 2002. In 
particular, as shown in Exhibit 13 attached hereto (i.e., the printout from 
the November 19, 2002 link of Exhibit 12), only a single paragraph and 
not the full Sherwood article (i.e., not Exhibits 4 and/or 8) could be 
accessed as of the November 19, 2002. Exhibit 14, which is the printout 
from the July 9, 2003 link of Exhibit 12, is identical to the November 19, 
2002 information. Therefore, only Exhibit 1 has been shown to be prior 
art. There is no evidence of record that the full Sherwood article (i.e., 
Exhibits 4 and/or 8) is prior art to this application. 

16. As noted above, Exhibit 2 invited the public to contact Brad Calder 
"to obtain a copy of the techreport." Therefore, there is a possibility that 
Mr. Calder was providing something beyond the content of Exhibit 1 to 
requestors prior to the filing date of this application. Accordingly, to 
determine if anything more than Exhibit 1 is prior art to the instant 
application, I sent the email attached hereto as Exhibit 15 to Mr. Calder on 
July 7, 2007 asking Mr. Calder to clarify the situation. Having had no 
response, I again sent the email attached hereto as Exhibit 16 to Mr. 



-5- 



U.S. Serial No. ^424,356 



Rule 131 Declaration 

Calder. As shown at Exhibit 17, Exhibit 16 was delivered to Mr. Calder. 
To date, he has made no response. 



that serves Exhibit 1 to the Internet asking for clarification as to the 
situation and to obtain the postscript file attached hereto as Exhibit 2. As 
shown in Exhibit 18, the webmaster responded by re-booting the server to 
make Exhibit 2 available, and by referring to Exhibits 9 and 10. As 
discussed above, none of this shows the full article (i.e., Exhibits 4 and/or 
8) to be prior art. Therefore, despite all efforts by the undersigned and the 
Examiner to date, nothing of record indicates that anything beyond Exhibit 
1 is prior art to the instant application. 
18. I understand that willful and false statements and the like are 

punishable by a fine and/or imprisonment under 18 U.S. C. § 1001, and 
that such willful false statement may jeopardize the validity of this 
application and any patent resulting therefrom. 



17. 



I also sent a request for clarification to the webmaster for the server 



Date: July 16, 2007 




-6- 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 1 
TO DECLARATION OF 
JAMES A FLIGHT 



Phase Tracking and Prediction 



Page 1 of 1 



Phase Tracking and Prediction 

Timothy Sherwood, Suleyman Sair and Brad Calder 

CS2002-0710 

June 23, 2002 

In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view 
of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by 
cycle examination. In many programs, behavior is anything but steady state, and understanding the 
patterns of behavior at run-time can unlock a multitude of optimization opportunities. In this paper we 
present a unified profiling architecture that can efficiently capture, classify, and predict program 
behavior on the largest of time scales all at run-time with no support from software. By examining the 
proportion of instructions that were executed from different sections of code, we can find generic phases 
that correspond to changes in behavior across many metrics. By classifying phases generically, we avoid 
the need to identify phases for each optimization, and enable a unified prediction scheme that can 
forecast future behavior. We examine the ability of our phase tracking architecture to accurately capture 
the phase behavior of a program's execution with respect to its overall performance (IPC), branch 
prediction, cache performance, and energy, and show how phase behavior may be captured efficiently 
using a simple predictor. 



How to view this document 

• Display the whole document in one of the following formats. 

o PostScrip t 10012 bvtes. 

• Print or download all or selected pages. 



The authors of these documents have submitted their reports to this technical report series 
for the purpose of non-commercial dissemination of scientific work. The reports are 
copyrighted by the authors, and their existence in electronic format does not imply that the 
authors have relinquished any rights. You may copy a report for scholarly, non-commercial 
purposes, such as research or instruction, provided that you agree to respect the author's 
copyright. For information concerning the use of this document for other than research or 
instructional purposes, contact the authors. Other information concerning this technical 
report series can be obtained from the Computer Science and Engineering Department at the 
University of California at San Diego, techreports@cs.ucsd.edu. 



[ Search j 




NCSTRL 



This sei-ver operates at UCSD Computer Science and Engineering, 
Send email to webmaster(a),cs. ucsd.edu 



ht±p://wwwxse;ucsd.edu/Dienst/m/2.0/Descri^ 



7/16/2007 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 2 
TO DECLARATION OF 
JAMES A FLIGHT 



Techreport 



To obtain a copy of this techreport, please 
look for it at the following site: 

http://vvww-cse.ucsd.edu/users/calder/papers.html 

Or send email or a letter to: 

Brad C alder 

University of California, San Diego 

9500 Gilman Drive 

La Jolla, CA 92093-0114 

calder@cs.ucsd.edu 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 3 

TO DECLARATION OF 
JAMES A FLIGHT 



Publications Page 1 of 11 



Publications 



The documents contained in these directories have been provided by the contributing authors as a means 
to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright 
and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that 
they have offered their works here electronically. It is understood that all persons copying this 
information will adhere to the terms and constraints invoked by each author's copyright. These works 
may not be reposted without the explicit permission of the copyright holder. 



• Thesis 



Publications are listed below by year of publication. For a list of papers by category, please click here . 
Book Chapters 

• Brad Calder, Timothy Sherwood, Greg Hamerly, and Erez Pereiman, SimPoint: Picking 
Representative Samples to Guide Simulation. Chapter 7 in the book "Performance Evaluation 
and Benchmarking" edited by Lizy Kurian John and Lieven Eeckhout; published by CRC Press, 
2005 

2007 

• Satish Narayanasamy, Zhenghao Wang, Jordan Tigani, Andrew Edwards and Brad Calder . 
Automatically Classifying Benign and Harmful Data Races Using Replay Analysis , ACM 
SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007 

CpM 

• Erez Pereiman, Jeremy Lau, Harish Patil, Aamer Jaleel, Greg Hamerly, and Brad Calder . Cross 
Binary Simulation Points . International Symposium on Performance Analysis of Systems and 
Software (ISPASS), April 2007 ( pdt) 

• Satish Narayanasamy, Ayse Coskun, and Brad Calder . Transient Fault Prediction Based on 
Anomalies in Processor Events , Design Automation and Test in Europe (DATE), April 2007 

(Bdf> 

• Weifeng Zhang, Brad Calder and Dean Tullsen . Accelerating and Adapting Precomputation 
Threads for Efficient Prefetching . In the International Symposium on High Performance 
Computer Architecture, February 2007. f pdf) 

• Wei Chuang, Satish Narayanasamy, and Brad Calder . Bounds Checking with Taint-Based 
Analysis . International Conference on High Performance Embedded Architectures & Compilers, 
January 2007 fpdf) 

2006 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



Page 2 of 1 1 



• Jack Sampson, Ruben Gonzalez, Jean-Francois Collard, Norm Jouppi, Mike Schlansker and Brad 
Calder , Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast 
Barriers . In proceedings of the International Symposium on Microarchitecture (MICRO), 
December 2006. (pdf) 

• Wei Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Biesbrouck, 
Gilles Pokam, Osvaldo Colavin, and Brad Calder . Unbounded Page-Based Transactional 
Memory . International Conference on Architectural Support for Programming Languages and 
Operating Systems (ASPLOS), Oct 2006 (pdf) 

• Satish Narayanasamy, Cristiano Pereira and Brad Calder . Recording Shared Memory 
Dependencies Using Strata , International Conference on Architectural Support for 
Prograrrirning Languages and Operating Systems (ASPLOS), October 2006 fpdfj 

• Satish Narayanasamy, Bruce Carneal and Brad Calder . Patching Processor Design Errors . 
International Conference on Computer Design, October 2006 ( pdf) 

• Weifeng Zhang, Steve Checkoway, Brad Calder, and Dean Tullsen . Speculative Code Value 
S pecialization Using the Trace Cache Fill Unit . International Conference on Computer Design, 
Oct 2006 ( pdf) 

• Satish Narayanasamy, Cristiano Pereira and Brad Calder, . Software Profiling for Deterministic 
Re play Debug gin g of User Code , The 5th International Conference on Software Methodologies, 
Tools and Techniques (SoMeT), October 2006 (pdf) 

• Michael Van Biesbrouck, Lieven Eeckhout, and Brad Calder, Efficient Sampling Startup for 
SimPoint . IEEE Micro Special Issue Jul/Aug 06 Modeling & Simulation ( pdf) 

• Jeremy Lau, Matt Arnold, Micheal Hind, and Brad Calder , Online Performance Auditing : 
Usin g Hot Optimizations Without Getting Burned . ACM SIGPLAN Conference on 
Progra mmin g Language Design and Implementation (PLDI), June 2006 ( pdf) 

• Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder . Automatic 
Loggin g of Operating System Effects to Guide Application Level Architecture Simulation . 

ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems 
(Sigmetrics), June 2006 ( pdf) 

• Greg Hamerly, Erez Perelman, Jeremy Lau, Brad Calder, and Timothy Sherwood . Using 
Machine Learning to Guide Architecture Simulation . Journal of Machine Learning Research 
(JMLR), 2006 ( pdf) 

• Erez Perelman, Marzia Polito, Jean- Yves Bouguet, John Sampson, Brad Calder, Carole Dulong 4 
Detecting Phases in Parallel Applications on Shared Memory Architectures . IEEE 
International Parallel and Distributed Processing Symposium (IPDPS), April 2006. £pdf) 

• Greg Hamerly, Erez Perelman, and Brad Calder, Comparing Multinomial and K-Means 
Clustering for SimPoint . International Symposium on Performance Analysis of Systems and 
Software (ISPASS), March 2006. ( pdf) 

• Michael Van Biesbrouck, Lieven Eeckhout, and Brad Calder, Considering All Starting Points 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



Page 3 of 1 1 



for Simultaneous Multithreading Simulation , International Symposium on Performance 
Analysis of Systems and Software (ISPASS), March 2006. fpdf) 

• Jeremy Lau, Erez Perelman, and Brad Calder, Selecting Software Phase Markers with Code 
Structure Analysis . International Symposium on Code Generation and Optimization (CGO), 
March 2006. ( pdf) 

• Weifeng Zhang, Brad Calder and Dean Tullsen, A Self Repairing Prefetcher in an Event- 
Driven Dynamic Optimization Fram ework , International Symposium on Code Generation and 
Optimization (CGO), March 2006. f pdf) 

2005 

• Satish Narayanasamy, Gilles Pokam, and Brad Calder BugNet: Recording A pplication Level 
Execution for Deterministic Replay Debug ging , IEEE Micro: Micro's Top Picks from 
Computer Architecture Conferences, December 2005 (pdf) 

• Michael Van Biesbrouck, Lieven Eeckhout, and Brad Calder, Efficient Sampling Startup for 
Sampled Processor Simulation . 2005 International Conference on High Performance Embedded 
Architectures & Compilers, November 2005. ( pdf) 

• Bengu Li, Ganesh Venkatesh, Brad Calder, and Rajiv Gupta, Exploiting a Computation Re use 
Cache to Reduce Energy in Network Processors . 2005 International Conference on High 
Performance Embedded Architectures & Compilers, November 2005. ( pdf) 

• Lieven Eeckhout, John Sampson, and Brad Calder, Ex ploiting Program Microarchitecture 
Independent Characteristics and Phase Behavior for Reduced Benchmark Suite Simulation . 

2005 IEEE International Symposium on Workload Characterization, October 2005 ( pdf) 

• Cristiano Pereira, Jeremy Lau, Brad Calder, and Rajesh Gupta, Dynamic Phase Analysis for 
Cycle-Close Trace Generation . International Conference on Hardware/Software Codesign and 
System Synthesis, September 2005 ( pdf) 

• Weifeng Zhang, Brad Calder, and Dean Tullsen, An Event-Driven Multithreaded Dynamic 
O ptimization Framework , International Conference on Parallel Arcliitectures and Compilation 
Techniques, September 2005. f pdf) 

• Erez Perelman, Trishul Chilimbi, and Brad Calder, Variational Path Profiling , International 
Conference on Parallel Architectures and Compilation Techniques, September 2005. £pdf) 

• Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder, SimPoint 3.0: Faster and More 
Flexible Program Analysis . Journal of Instruction Level Parallelism, September 2005. ( pdf) 

• Brad Calder, Andrew Chien, Ju Wang and Don Yang, The Entropia Virtual Machine for 
Desktop Grids . International Conference on Virtual Execution Environments, June 2005. ( pdf) 

• Satish Narayanasamy, Gilles Pokam, and Brad Calder, Bu gNet: Continuously Recording 
Pro gram Execution for Deterministic Replay Debug ging . International Symposium on 
Computer Architecture, June 2005. (pdf) 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



Page 4 of 1 1 



• Satish Narayanasamy, Hong Wang, Perry Wang, John Shen, Brad Calder, A Dependency Chain 
Clustered Microarchitecture , IEEE International Parallel and Distributed Processing 
Symposium, April 2005. ( pdf) 

• Jeremy Lau, Jack Sampson, Erez Perelman, Greg Hamerly, and Brad Calder, The Strong 
Correlation between Code Signatures and Performance . IEEE International Symposium on 
Performance Analysis of Systems and Software, March 2005. ( pdf) 

• Jeremy Lau, Erez Perelman, Greg Hamerly, Timothy Sherwood, and Brad Calder, Motivation for 
Variable Length Intervals and Hierarchical Phase Behavior . IEEE International Symposium 
on Performance Analysis of Systems and Software, March 2005. ( pdf) 

• Jeremy Lau, Stefan Schoenmackers, and Brad Calder, Transition Phase Classifica tion and 
Prediction . In the 1 1th International Symposium on High Performance Computer Architecture, 
February 2005. ( pdf) 

2004 

• Nathan Tuck, Brad Calder and George Varghese, . Hardware and Binary Modification Su p port 
for Code Pointer Protection From Buffer Overflow . 37th International Symposium on 
Microarchitecture, December 2004. ( pdf) 

• Eric Tune, Rakesh Kumar, Dean Tullsen and Brad Calder . Balanced Multithreading: 
Increasing Throug h put via a Low Cost Multithreading Hierarchy , 37th International 
Symposium on Microarchitecture, December 2004. (pdf) 

• Timothy Sherwood, Mark Oskin, and Brad Calder . Balancing Design Options with Sherp a . 

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems 
(CASES), September 2004, ( pdf) 

• Brad Calder, Todd Austin, Don Yang, Timothy Sherwood, Suleyman Sair, David Newquist and 
Tim Cusac , BitRaker Anvil: Binary Instrumentation for Rapid Creation of Simulatiqnan d 
Workload Analysis Tools , Proceedings of Global Signal Processing (GSPx), September, 2004, 
(pdfi 

• Greg Hamerly, Erez Perelman, and Brad Calder, How to Use SimPoint to Pick Simulation 
Points ACM SIGMETRICS Performance Evaluation Review, Volume 31(4), March 2004 (pdfi 

• Glenn Reiman and Brad Calder, Using a Serial Cache for Energy Efficient Instruction 
Fetching , Journal of Systems Architecture, 2004, ( pdf) 

• Jeremy Lau, Stefan Schoenmackers, and Brad Calder, Structures for Phase Classification, 2 004 
IEEE International Symposium on Performance Analysis of Systems and Software, March 2004 
(pdf). 

• Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder, A Co-Phase Matrix to Guide 
Simultaneous Multithreading Simulation, 2004 IEEE International Symposium on Performance 
Analysis of Systems and Software, March 2004 (pdf) 

• Nathan Tuck, Timothy Sherwood, Brad Calder, and George Varghese, Deterministic Memory- 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



Page 5 of 11 



Efficient String Matching Algorithms for Intrusion Detection . Proceedings of the IEEE 
Infocom Conference, Hong Kong, China, March 2004. ( p df) 

• SatishNarayanasamy, Yuanfang Hu, Suleyman Sair, and Brad Calder, Creating Converg ed 
Trace Schedules Using String Matching . In the 10th International Symposium on High 
Performance Computer Architecture, February 2004. (pdf) 

2003 

• Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, and Brad Calder, Discovering 
and Exploiting Program Phases, IEEE Micro: Micro's Top Picks from Computer Architecture 
Conferences, December 2003 ( pdf) 

• Jeremy Lau, Stefan Schoenmackers, Timothy Sherwood, and Brad Calder, Reducing Code Size 
With Echo Instructions . International Conference on Compilers, Architecture,, and Synthesis for 
Embedded Systems, October 2003. ( pdf) 

• Erez Perelman, Greg Hamerly, and Brad Calder, Picking Statistically Valid and Early 
Simulation Points . International Conference on Parallel Architectures and Compilation 
Techniques, September 2003. ( pdf) 

• Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder, 
Using SimPoint for Accurate and Efficient Simulation. . ACM SIGMETRICS the International 
Conference on Measurement and Modeling of Computer Systems, June 2003. ( pdf) 

• Timothy Sherwood, George Varghese, and Brad Calder, A Pipelined Memory Architecture for 
Hi gh Throug h put Network Processors , 30th International Symposium on Computer 
Architecture, June 2003. ( pdf) 

• Timothy Sherwood, Suleyman Sair, and Brad Calder, Phase Tracking and Prediction, 3 0th 
International Symposium on Computer Architecture, June 2003. ( pdf) 

• Weihaw Chuang and Brad Calder, Predicate Prediction for Efficient Out-of-order Execution, 

16th Annual ACM International Conference on Supercomputing, June 2003. ( pdf) 

• Weihaw Chuang, Brad Calder, and Jeanne Ferrante, Phi Predication for Light Weig ht If- 
Conversion , International Symposium on Code Generation and Optimization, March 2003. ( pdf) 

• Suleyman Sair, Timothey Sherwood and Brad Calder, A Decoupled Predictor-Directed Stream 
Prefetching Architecture . In the IEEE Transactions on Computers, Vol. 52, No. 5, March 2003 

• Andrew Chien, Brad Calder, Stephen Elbert, and Karan Bhatia . Entropia: Architecture and 
Performance of an Enterprise Desktop Grid System . Journal of Parallel Distributed 
Computing, Vol 63, Issue 5, May 2003, pages 597-610. £pdfj 

• Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese, 
Catching Accurate Profiles in Hardware . In the 9th International Symposium on High 
Performance Computer Architecture, February 2003. ( pdf) 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



Page 6 of 1 1 



• Beth Simon, Brad Calder, and Jeanne Ferrante, Incorporating Predicate Information Into 
Branch Predictors . 9th International Symposium on High Performance Computer Architecture, 
February 2003. (pdf) 

2002 

• Jamison Collins, Suleyman Sair, Brad Calder, and Dean Tullsen, Pointer-Cache Assisted 
Prefetching . To appear in the 35th Annual International Symposium on Microarchitecture, 
November 2002. ( pdf) 

• Timothy Sherwood, Erez Perelman, Greg Hamerly and Brad Calder, Automatically 
Characterizing Large Scale Program Behavior . In the 10th International Conference on 
Architectural Support for Progr ammin g Languages and Operating Systems, October 2002. ( pdf) 

• Eric Tune, Dean Tullsen, and Brad Calder, Quantifying Instruction Cri ticality, in the Eleventh 
International Conference on Parallel Architectures and Compilation Techniques, September 2002. 
(pdfi 

• Lori Carter and Brad Calder, Usin g Predicate Path Information in Hardware to Determine 
True Dependences . In the proceedings of the 16th Annual ACM International Conference on 
Supercomputing, June 2002. (pdf) 

• Lori Carter, Weihaw Chuang, and Brad Calder An EPIC Processor with Pending Functional 
Units . In the proceedings of the 4th International Symposium on High Performance Computing 
(ISHPC2K), May 2002, (c) Springer-Verlae . £pdfj 

• Suleyman Sair, Timothy Sherwood and Brad Calder, Quantifying Load Stream Behavior. , 8th 
International Symposium on High-Performance Computer Architecture, February 2002. (pdf). 

2001 

• Timothy Sherwood and Brad Calder, Patchable Instruction ROM Architecture. , International 
Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 
November 2001 . ( pdf) 

• Timothy Sherwood, Erez Perelman and Brad Calder, Basic Block Distribution Analysis to Find 
Periodic Behavior and Simulation Points in A p plications , International Conference on Parallel 
Architectures and Compilation Techniques, September 2001. ( pdf) 

■ Chandra Krintz and Brad Calder, Reducing Delay with Dynamic Selection of Compression 
Formats T enth international Symposium on High Performance Distributed Computing, August 
2001. (pdfj 

■ Timothy Sherwood and Brad Calder, Automated Design of Finite State Machine Predictors for 
Customized Processors . 28th International Symposium on Computer Architecture, June 2001. 

torn 

• Chandra Krintz and Brad Calder, Using Annotations to Reduce Dynamic Optimization Time, 
ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation 
(PLDI), June 2001. ( pdf) 



http : //www-cse.ucsd. edu/users/ calder/pap ers.html 



7/16/2007 



Publications 



Page 7 of 11 



• Glenn Reinman, Brad Calder, and Todd Austin, O ptimizations Enabled by a Decoupled Front- 
End Architecture. IEEE Transactions on Computers, Vol. 50, No. 4, April 2001. (pdf) 

• Chandra Krintz, David Grove, Vivek Sarkar, and Brad Calder, Reducing the Overhead of 
Dynamic Compilation . Software: Practice and Experience, pages 717-738, Volume 31, Issue 8, 
March 2001.£pdf) 

• Eric Tune, Dongning Liang, Dean M. Tullsen, and Brad Calder, Dynamic Prediction of the 
Critical Dependence Path . 7th International Symposium On High Performance Computer 
Architecture, January 2001. ( pdf) 

2000 

• Timothy Sherwood, Suleyman Sair, and Brad Calder, Predictor-Directed Stream Buffers . In 
proceedings of the 33rd International Symposium on Microarchitecture, December 2000. £pdf) 

• Lori Carter, Beth Simon, Brad Calder, Larry Carter, and Jeanne Ferrante, Path Analysis and 
Renaming for Predicated Instruction Scheduling International Journal of Parallel 
Programming, pages 563-588, December 2000. ( pdf) 

• Timothy Sherwood and Brad Calder, Loop Termination Prediction . In the proceedings of the 
3rd International Symposium on High Performance Computing (ISHPC2K), October 2000, .(c) 
S pringer-Verlag . (pdf) 

• Barbara Kreaseck, Dean Tullsen, and Brad Calder, Limits of Task-based Parallelism in 
Irregular Applications . In the proceedings of the 3rd International Symposium on High 
Performance Computing (ISHPC2K), October 2000, fc) Springer-Verlag . ( pdf) 

• Timothy Sherwood and Brad Calder, TooIBlocks: An Infrastructure for the Construction of 
Memory Hierarchy Analysis Tools . In the proceedings of the International European 
Conference on Parallel Computing (EURO-PAR), August 2000. ( pdf) 

• Timothy Sherwood and Brad Calder, . Automated Design of Finite State Machine Predictors 
for Value Prediction and Branch Prediction . UCSD Techreport, CS2000-0656, June 2000 
£pdO 

• Brad Calder and Glenn Reinman, A Comparative Survey of Load Speculation Architectures , 

Journal of Instruction Level Parallelism, May 2000. (pdf) 

1999 

- Glenn Reinman, Brad Calder, and Todd Austin, Fetch Directed Instruction Prefetching . In 

proceedings of the 32nd International Symposium on Microarchitecture, November 1 999. (pdf) 

• Chandra Krintz, Brad Calder, and Urs Holzle, Reducing Transfer Delay Using Java Class File 
S plitting and Prefetching . In proceedings of the 14th Annual ACM SIGPLAN Conference on 
Object-Oriented Prograrmriing Systems, Languages, and Applications, November 1999. (pdf) 

• Lori Carter, Beth Simon, Brad Calder, Larry Carter, and Jeanne Ferrante, Predicated Static 
Single Assignment . In proceedings of the International Conference on Parallel Architectures and 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



i 



Page 8 of 1 1 



Compilation Techniques, October 1999. fpdf) 

> Timothy Sherwood and Brad Calder, Time Varying Behav ior of Programs. UC San Diego 
Technical Report UCSD-CS99-630, August 1999. £pdfj 

• Timothy Sherwood, Brad Calder, and Joel Emer, Reducing Cache Misses Using Hardware and 
Software Page Placement . In the ACM International Conference on Supercomputing, pages 
155-164, June 1999. £pdfl 

• Glenn Reinman, Brad Calder, Dean Tullsen, Gary Tyson, and Todd Austin, Classifying Load 
and Store Instructions for Memory Renaming . In the ACM International Conference on 
Supercomputing, pages 399-407, June 1999. f pdf) 

• Glenn Reinman, Todd Austin, and Brad Calder, A Scalable Front-End Architecture for Fast 
Instruction Delivery . 26th International Symposium on Computer Architecture, pages 234-245, 
May 1999. fpdf) 

• Brad Calder, Glenn Reinman, and Dean Tullsen, Selective Value Prediction , 26th International 
Symposium on Computer Architecture, pages 64-73, May 1999. ( pdf) 

• Brad Calder, Peter Feller, and Alan Eustace, Value Profiling and Optimization , Journal of 
Instruction Level Parallelism, March 1999. (pdf) 

• Steven Wallace, Dean Tullsen, and Brad Calder, Instruction Recycling on a Multiple-Path 
Processor . 5th International Symposium On High Performance Computer Architecture, pages 
44-53, January 1999. ( pdf) 

• Brad Calder and Dirk Grunwald, The Precomputed Branch Architecture . Journal of Systems 
Architecture, pages 651-679, Vol. 45, 1999. f pdf) 

1998 

• Glenn Reinman and Brad Calder, Predictive Techniques for Aggressive Load Speculation . 

31st International Symposium on Microarchitecture, pages 127-137, December 1998. f pdf) 

• Brad Calder, Chandra Krintz, Simmi John, and Todd Austin, Cache-Conscious D ata Placement . 
8th International Conference on Architectural Support for Programming Languages and Operating 
Systems, San Jose, California, pages 139-149, October 1998. f pdf) 

• Chandra Krintz, Brad Calder, Han Bok Lee, and Benjamin G. Zorn, Overlapping Execution with 
Transfer Using Non-Strict Execution for Mobile Programs . 8th International Conference on 
Architectural Support for Programming Languages and Operating Systems, San Jose, California, 
pages 159-169, October 1998. f pdf) 

• Artur Klauser, Todd Austin, Dirk Grunwald, and Brad Calder, Dynamic Hammock Predication 
for Non-predicated Instruction Set Architectures. International Conference on Parallel 
Architectures and Compilation Techniques, Paris, France, pages 278-285, October 1998. ( pdf) 

• Dean Tullsen and Brad Calder, Computing Along the Critical Path. UC San Diego Technical 
Report, October 1998. fjpdf) 



http://www-cse.ucsd.eduyusers/calder/papers.html 



7/16/2007 



Publications 



( 



Page 9 of 1 1 



• Iris Bahar, Brad C alder and Dirk Grunwald, A Comparison of Software Code Reordering and 
Victim Buffers . Third Workshop on Interaction Between Compilers and Computer Architectures, 
October 1998. l£djj 

- Steven Wallace, Brad Calder, and Dean Tullsen, Threaded Multiple Path Execution . 25th 
International Symposium on Computer Architecture, pages 238-249, June 1998. ( pdf) 

1997 

■ Brad Calder, Peter Feller, and Alan Eustace, Value Profiling . 30th International Symposium on 
Microarchitecture, pages 259-269, December 1997. (pdf) 

• Nikolas Gloy, Trevor Blackwell, Michael D. Smith, and Brad Calder, Procedure Placement 
Using Temporal Ordering Information. 30th International Symposium on Microarchitecture, 
pages 303-313, December 1997. £pdfj 

• Amir H. Hashemi, David R. Kaeli, and Brad Calder, Efficient Procedure Mapping Using Cache 
Line Coloring . Proceedings of the SIGPLAN Conference on Programming Language Design and 
Implementation, pages 171-182, June 1997. (pelf). 

• Amir H. Hashemi, David R. Kaeli, and Brad Calder, Procedure Map pin g Using Static Call 
Graph E stimation, Workshop on the Interaction between Compilers and Computer 
Architectures, San Antonio, Texas, February 1997 ( pdf) 

• Brad Calder, Dirk Grunwald, Donald Lindsay, Michael Jones, James Martin, Michael Mozer, and 
Benjamin Zorn, Evidence-based Static Branch Prediction using Machine Learning . ACM 

Transactions on Programming Languages and Systems, pages 1 88-222, Vol. 19, No. 1, January 
1997.£pdfJ 

1996 

• Brad Calder, Dirk Grunwald, and Joel Emer, Predictiv e Seq uential Associative Cache . 2nd 
International Symposium on High Performance Computer Architecture, pages 244-253, February 
1996. fpdf) 

1995 

• Brad Calder, Dirk Grunwald, and Amitabh Srivastava, The Pred ictability of Branches in 
Libraries . 28th International Symposium on Microarchitecture, pages 24-34, November 1995, 
WRL Research Report 95/6 version, ( pdf) 

• Brad Calder, Dirk Grunwald, and Joel Emer, A System Level Perspective on Branch 
Architecture Performance , 28th International Symposium on Microarchitecture, pages 199-206, 
November 1995. £pdfj 

• Brad Calder and Dirk Grunwald, Next Cache Line and Set Prediction . 22nd International 
Symposium on Computer Architecture, pages 287-296, June 1995. ( pdf) 

• Dennis Lee, Jean-Loup Baer, Brad Calder, and Dirk Grunwald, Instruction Cache Fetch Policies 
for Speculative Execution . 22nd International Symposium on Computer Architecture, pages 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



{ 



Page 10 of 11 



357-367, June 1995. ( pdf) 

• Brad Calder, Dirk Grunwald,, Donald Lindsay, James Martin, Michael Mozer, and Benjamin Zora, 
Corpus-based Static Branch Prediction . Proceedings of the SIGPLAN Conference on 
Programming Language Design and Implementation, pages 79-92, June 1995. (pdf) 

1994 

• Brad Calder, Dirk Grunwald and Benjamin Zorn, Quantifying Behavioral Differences Between 
C and C++ Programs . Journal of Programming Languages, pages 313-351, Vol 2, Num 4, 1994. 

km 

• Brad Calder and Dirk Grunwald, Reducing Branch Costs via Branch Alignment . 6th 

International Conference on Architectural Support for Programming Languages and Operating 
Systems, pages 242-251, October 1994. (pdf) 

• Brad Calder and Dirk Grunwald, Fast and Accurate Instruction Fetch and Branch Prediction . 

21st International Symposium on Computer Architecture, pages 2-11, April 1994. ( pdf) 

• Brad Calder and Dirk Grunwald, Reducing Indirect Function Call Overhead In C++ 
Programs . 21st Symposium on Principles of Programming Languages, pages 397-408, January 
1994. (pM 

1993 

> Dave Wagner and Brad Calder, Leapfroggin g: A Portable Technique for Implementing 
Efficient Futures . 4th ACM Principles and Practice of Parallel Processing, pages 208-217, May 
1993. ( pdf) 



Thesis (Back to top) 

Peter Feller M.S., Value Profiling for Instructions and Memory Locations . April 1998 

Chandra Krintz P h.D.. Reducing Load Del ay to Im prove Performance of Internet-Computing Programs . 
May 2001 

Glenn Reinman Ph.D.. Hardware O ptimizations Enabled by a Decoup led Fetch Architecture, June 2001 

Beth Simon P h.D.. Turning Predicate Information to Advantage to Improve Compiler Scheduling and 
Branch Prediction , December 200 1 . 

Lori Carter . Ph.D., Com piler and Hardware Predicated Dependency Analysis and Scheduling , February 
2002. 

Timothy Sherwood . Ph.D., Ap plication-Tuned Processor Architectures . June 2003. 

Suleyman Sair . Ph.D., Predictor-Directed Data Prefetching for Pointer-based Applications . June 2003. 



htto://wvm-cse.ucsd.edu/users/calder/papers.htrnl 



7/16/2007 



Publications Page 11 of 11 

i 

Brad Calder, Hardware and Software Mechanisms for Instruction Fetch Prediction, Dissertation 
University of Colorado Technical Report CU-CS-781-95, August 1995. 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 4 
TO DECLARATION OF 
JAMES A FLIGHT 



In Proceedings of 30th International Symposium on Computer J itecture (ISC A), June 2003. 



Phase Tracking and Prediction 

Timothy Sherwood Suleyman Sair Brad Calder 

Department of Computer Science and Engineering 
University of California, San Diego 
{sherwood, s sair,calder} @ cs .ucsd.edu 



Abstract 

In a single second a modern processor can execute billions 
of instructions. Obtaining a bird's eye view of the behavior of a 
program at these speeds can be a difficult task when all that is 
available is cycle by cycle examination. In many programs, be- 
havior is anything but steady state, and understanding the pat- 
terns of behavior, at run-time, can unlock a multitude of opti- 
mization opportunities. 

In this paper, we present a unified profiling architecture that 
can efficiently capture, classify, and predict phase-based pro- 
gram behavior on the largest of time scales. By examining the 
proportion of instructions that were executed from different sec- 
tions of code, we can find generic phases that correspond to 
changes in behavior across many metrics. By classifying phases 
genetically, we avoid the need to identify phases for each opti- 
mization, and enable a unified prediction scheme that can fore- 
cast future behavior. Our analysis shows that our desigri can 
capture phases that account for over 80% of execution using less 
that 500 bytes of on-chip memory, 

1 Introduction 

Modern processors can execute upwards of 5 billion instructions 
in a single second, yet most architectural features target program 
behavior on a time scale of hundreds to thousands of instruc- 
tions, less than half a pS. While these optimizations can provide 
large benefits, they are limited in their ability to see die program 
behavior in a larger context. 

Recently there has been a renewed interest in examin- 
ing the run-time behavior of programs over longer periods of 
time [10, 11, 19, 20, 3]. It has been shown that programs can 
have considerably different behavior depending on which por- 
Uon of execution is examined. More specifically, it has been 
shown that many programs execute as a series of phases, where 
each phase may be very different from the others, while still hav- 
ing a fairly homogeneous behavior within a phase. Taking ad- 
vantage of this time varying behavior can lead to, among other 
things, improved power management, cache control, and more 
efficient simulation. The primary goal of this research is the de- 
velopment of a unified run-time phase detection and prediction 
mechanism that can be used to guide any optimization seeking 
to exploit large scale program behavior. 

A phase of program behavior can be defined in several ways. 
Past definitions are built around the idea of a phase being an in- 
terval of execution during which a measured program metric is 
relatively stable. We extend this notion of a phase to include all 
similar sections of execution regardless of temporal adjacency. 
Simply put, if a phase of execution is correctly identified, there 



should only be small variations between any two execution in- 
tervals identified as being part of the same phase. A key point of 
this paper is that the phase behavior seen in any program metric 
is directly a function of the way the code is being executed. If 
we can accurately capture this behavior at run-time through the 
computation of a single metric, we can use this to guide many 
optimization and policy decisions without duplicating phase de- 
tection mechanisms for each optimization. 

In this paper, we present an efficient run-time phase tracking 
architecture that is based on detecting changes in the propor- 
tions of the code being executed. In addition, we present a novel 
phase prediction architecture that can predict, not only when a 
phase change is about to occur, but also the phase to which -it 
is will transition. Since our phase tracking implementation is 
based upon code execution frequencies, it is independent of any 
individual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics allows us to consistently track phase 
information as the program's behavior changes due to phase- 
based optimizations. 

We demonstrate the effectiveness of our hardware based 
phase detection and classification architecture at automatically 
partitioning the behavior of the program into homogeneous 
phases of execution and to identify phase changes. We show 
that the changes in many important metrics, such as IPC and en- 
ergy, correlate very closely with the phase changes found by our 
metric. We then evaluate the effectiveness of phase tracking and 
prediction for value profiling, data cache reconfiguration, and 
re-configuring the width of the processor. 

The rest of the paper is laid out as follows. In Section 2, 
prior work related to phase-based program behavior is discussed. 
Simulation methodology and benchmark descriptions can be 
found in Section 3. Section 4 describes our phase tracking ar- 
chitecture. The design and evaluation of the phase predictor are 
found in Section 5. Section 6 presents several potential applica- 
tions of our phase tracking architecture. Finally, the results are 
summarized in Section 7. 

2 Related Work 

In this Section we describe work related to phase identification 
and phase-based optimization. 

In [19], we provided an initial study into the time varying 
behavior of programs, showing that programs have repeatable 
phase-based behavior over many hardware metrics — cache be- 
havior, branch prediction, value prediction, address prediction, 



1 



( 



IPC and RUU occupancy for all the SPEC 95 programs. Looking 
at these metrics over time, we found that many programs have 
repeating patterns, and that important metrics tend to change at 
the same time. These places represent phase boundaries. 

In [20], we proposed that by profiling only the code that was 
executed over time we could automatically identify periodic and 
phase behavior in programs. The goal was to automatically find 
the repeating patterns observed in [19], and the lengths (peri- 
ods) of these patterns. We then extended this work in [21], using 
techniques from machine learning to break the complete exe- 
cution of the program into phases (clusters) by only tracking the 
code executed. We found that intervals of execution grouped into 
the same phase had similar behavior across all the architecture 
metrics examined. From this analysis, we created a tool called 
SimPoint [21], which automatically identifies a small set of in- 
tervals of execution (simulation points) in a program to perform 
architecture simulations. These simulation points provide an ac- 
curate and efficient representation of the complete execution of 
the program. 

The work of Dhodapkar and Smith [10, 9] is the most closely 
related to ours. They found a relationship between phases and 
instruction working sets, and that phase changes occur when the 
working set changes. They propose that by detecting phases and 
phase changes, multi-configuration units can be re-configured in 
response to these phase changes. They have used their working 
set analysis for instruction cache, data cache and branch predic- 
tor re-configuration to save energy [10, 9], 

The work we present in this paper identifies phases and phase 
changes by keeping track of the proportions in which the code 
was executed during an interval based upon the profiler used 
in [20]. In comparison, Dhodapkar and Smith [10, 9] track the 
phase and phase changes solely upon what code was executed 
(working set), without weighting the code by its frequency of 
execution. Future research is needed to compare these two ap- 
proaches. 

Additional differences between our work include our exam- 
ination of architectures for predicting phase changes, and differ- 
ent uses from [10, 9], such as value profiling and processor width 
reconfiguration. We provide an architecture that can fairly accu- 
rately predict what the next phase will be, along with predicting 
when there will be a phase change. In comparison, Dhodapkar 
and Smith do not examine phase-based prediction [10, 9], but 
concentrate on detecting when the working set size changes, and 
then reactively apply optimization. 

Merten et al. [15] developed a run-time system for dynami- 
cally optimizing frequently executed code. Then in [3], Barnes 
et al. extend this idea to perform phase-directed complier op- 
timizations. The main idea is the creation of optimized code 
"packages" that are targeted towards a given phase, with the goal 
of execution staying within the package for that phase. Barnes et 
al. concentrate primarily on the compiler techniques needed to 
make phase-directed compiler optimizations a reality, and do not 
examine the mechanics of hardware phase detection and classi- 
fication. We believe that using the techniques in [3] in conjunc- 
tion with our phase classification and prediction architecture will 
provide a powerful run-time execution environment. 



I Cache 


tencv 4 ~ Way Set " nSSDCmUVe ' 51 bylC b ' 0CtCS ' ' CyC ' e 13 


D Cache 


16k 4-way set-associative, 32 byte blocks, 1 cycle la- 


L2 Cache 


12S& 8-way set-associative, 64 byte Blocks, 12 cycle la- 


Main Memory 


120 cycle latency 


Branch Pred 


modal predictor 


O-O-O Issue 


out-of-order issue ot up to 4 operations per cycle, 04 en- 
try re-order buffer 


Mem Disambig 


load/store queue, loads may execute when all prior store 

nridmsgRS nre knnwn 


Registers 


32 integer, 32 floating point 


Func Units 


2-integer ALU, 2-load/store units, l-HP adder, 1 -integer 
MULT/DIV. 1-FP MULT/DIV 


Virtual Mem 


SK byte pages, 30 cycle fixed TLB miss latency after 
earlier-issued instructions complete 



Table J: Baseline Simulation Model. 



3 Methodology 

To perform our study, we collected information for ten 
SPEC 2000 programs applu, apsi, art, bzip, facerec, 
galgel, gec, gzip, mcf, and vpr all with reference inputs. 
All programs were executed from start to completion using Sim- 
pleScalar [5] and Wattch [4]. Because of the lengthy simulation 
time incurred by executing all of the programs to completion, 
we chose to focus on only 10 programs. We chose the above 
10 programs since their phase based behavior represents a rea- 
sonable snapshot of the SPEC 2000 benchmark suite, along with 
picking some of the programs that showed the most interesting 
phase-based behavior. Each program was compiled on a DEC 
Alpha AXP-21 164 processor using the DEC C, and FORTRAN 
compilers. The programs were built under OSF/1 V4.0 operating 
system using full compiler optimization (-04 -if o). 

The timing simulator used was derived from the Sim- 
pleScalar 3.0 tool set [5], a suite of functional and timing simu- 
lation tools for the Alpha AXP ISA. The baseline microarchitec- 
ture model is detailed in Table 1 . In addition to this, we wanted 
to examine energy usage optimizations, so we used a version of 
Wattch [4] to capture this information. We modified all of these 
tools to log and reset the statistics every 10 million instructions, 
and we use this as a base for evaluation. 

4 Phase Capture 

In this section we motivate the occurrence of phase-based behav- 
ior, describe our architecture for capturing it, and examine the 
accuracy of using the program behavior in our phase-tracking 
architecture to identify phase changes for various hardware met- 
rics. 

4.1 Phase-Based Behavior 

The goal of this research is to design an efficient and general pur- 
pose technique for capturing and predicting the run-time phase 
behavior of programs for the purpose of guiding any optimiza- 
tion seeking to exploit large scale program behavior. Figure 1 
helps to motivate our approach to the problem. This figure shows 
the behavior of two programs, gec and gzip, as measured by 
various different statistics over the course of their execution from 
start to finish. Each point on the graph is taken over 10 mil- 
lion instructions worth of execution. The metrics shown are the 



2 




>;1E + Q9 
jjj 5E+0B 



1.SE+DS 
H 1 E+DB 
"° 500DDD 

E" 20000 







I I 


I I 


I I 






- 




h 




I 



























10B 20B 30B 40B OB 

Figure 1: To illustrate the point that phase changes happen across many metrics all at the same time, we have plotted the value 
of these metrics over billions of instructions executed for the programs gcc (shown left) and gzip (shown right). Each point on 
the graph is an average over 10 million instructions. The number of unified L2 cache misses (ul2), the energy consumed by the 
execution of the instructions, the number of instruction cache (ill) misses, the number of data cache misses (dll), the number of 
branch mispredictions (bpred) and the average IPC are plotted. 



number of unified L2 cache misses (ul2), the energy consumed 
by the execution of the instructions, the number of instruction 
cache (ill) misses, the number of data cache misses (dll), the 
number of branch mispredictions (bpred) and the average IPC. 
The results show that all of the metrics tend to change in unison, 
although not necessarily in the same direction. In addition to 
this, patterns of recurring behavior can be seen over very large 
time scales. 

As can be seen from these graphs, even at a granularity of 10 
million instructions (which is at the same time scale as operating 
system time slices) there can be wildly different behavior seen 
between intervals. In this paper, we concentrate on a granularity 
of 10 million instructions because it is both outside the scope 
of normal architectural timing and is small enough to allow for 
many complex phase behaviors to be seen. 

4.2 TVacking Phases by Executed Code 

Our phase tracker architecture operates at two different time 
scales. It gathers profile information very quickly in order to 
keep up with processor speeds, while at the same time it com- 
pares any data it gathers with information collected over the long 
term. Additionally, it must be able to do all that while still being 
reasonable in size. 

Our phase profile generation architecture can be seen in Fig- 
ure 2. The key idea is to capture basic block information during 
execution, while not relying on any compiler support. Larger 
basic blocks need to be weighed more heavily as they account 
for a more significant portion of the execution. To approximate 
gathering basic block information, we capture branch PCs and 
the number of instructions executed between branches. The in- 
put to the architecture is a tuple of information: a branch iden- 
tifier (PC) and the number of instructions since the last branch 



PC was executed. This allows us to roughly capture each basic 
block executed along with the weight of the basic block in terms 
of the number of instructions executed, as we did in [20, 21] for 
identifying simulation points. 

Classifying phases by examining only the code that is ex- 
ecuted allows our phase tracker to be independent of any in- 
dividual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics is also very important to allow us to 
consistently track phase information as the program's behavior 
changes due to phase-based optimizations. 

At this point it is worth making more explicit the differences 
between our technique and that of Dhodapkar and Smith [10, 9], 
Dhodapkar and Smith use a bit vector to track the working set of 
the code for a particular interval. While our technique is based 
on the basic block vectors used in [20]. The bit vectors of Dho- 
dapkar and Smith track a metric that is related to which code 
blocks were touched, whereas our metric tracks the proportion 
of time spent executing in each code block. This is a subtle but 
important distinction. We have found that in complex programs 
(such as gcc and gzip) there are many instructions blocks that 
execute only intermittently. When tracking the pure working set, 
these infrequently executed blocks can disguise the frequently 
executed blocks that dominate the behavior of the application. 
On the other hand, by tracking the frequency of code execution 
it is possible to distinguish important instructions (basic blocks) 
from a sea of infrequently executed ones. Examining these dif- 
ferences in more detail is a topic of future research. 

Another advantage of tracking the proportions in which the 
basic blocks are executed is that we can use this information to 



3 



Accumulator 




Figure 2: Our phase classification architecture. Each branch PC 
is captured along with the number of instructions from the last 
branch. The bucket entry corresponding to a hash of the branch 
PC is incremented by the number of instructions. After each 
profiling interval has completed, this information is classified, 
and if it is found to be unique enough, stored in the past footprint 
table along with its phase ID. 

identify not only when different sections of code are executing, 
but also when those sections of code are being exercised differ- 
ently. A simple example is in a graphics manipulation program 
running a parameterized filter on an input image. If you run a 
simple 3x3 blur filter on an image you get very different behavior 
than if you run a 7x7 filter on the same image despite the fact that 
the same filter code is executing. The 7x7 filter will have many 
more memory references and those memory references conflict 
very differently in the cache than in the 3x3 case. We have seen 
this very behavior in examining the interactive graphics program 
xv. Using the proportion of execution for each basic block can 
distinguish these differences, because in the 3x3 filter the head 
of the loop is called more than twice as frequently as in the 7x7 
filter. 

The same general idea applies to other data structures as 
well. Take for example a linked list. As the number of nodes in 
the linked list traversal changes over different loop invocations, 
the number of instructions executed inside the loop versus the 
time spent outside the loop also changes. This behavior can be 
captured when including a measure of the proportion of the code 
executed, and this can distinguish between link list traversals of 
different lengths. 

4.3 Capturing the Code Profile 

To index into the accumulator table in Figure 2, the branch PC 
is reduced to a number from 1 to Nbuckets using a hash func- 
tion. We have found that 32 buckets is sufficient to distinguish 
between different phases even for some of the more complex 
programs such as gcc. A counter is kept for each bucket, and 
the counter is incremented by the number of instructions from 
the last branch to the current branch being processed. Each ac- 
cumulator table entry is a large (in this study 24-bit), saturating 
counter, which will not saturate during our profiling interval of 
10 million instructions. Updating the accumulator table is the 
only operation that needs to be performed at a rate equivalent to 



the processor's execution of the program (once for every branch 
executed). In comparison, the phase classification described be- 
low needs to only be performed once every 10 million instruc- 
tions (at the end of each interval), and thus is not nearly as per- 
formance critical. 

We note that the hashing function we use is fundamentally 
the same as the random projection method we used to generate 
phases in [21]. In this prior work, we make use of random pro- 
jections of the data to reduce the dimensionality of the samples 
being taken. A random projection takes trace data in the form of 
a matrix of size LxB, where L is the length of the trace and B is 
the number of unique basic blocks, and multiplies it by a random 
matrix of size BxJV, where N is the desired dimensionality of 
the data which is much smaller than B. This creates a new ma- 
trix of size L x N, which has clustering properties very similar 
to the original data. The random projection method is a powerful 
technique when used with clustering algorithms, and for captur- 
ing phase behavior as we showed in [21]. The hashing scheme 
we use in this paper is essentially a degenerate form of random 
projection that makes a hardware implementation feasible while 
still having low error. If all the elements of the random projec- 
tion matrix consist of either a 0 or a 1 , and they are placed such 
that no column of the matrix contains more than a single 1, then 
the random projection is identical to this simple hashing mech- 
anism. We have designed our phase classification architecture 
around this principle. 

Figure 3 shows the effect of applying the above mentioned 
technique for capturing the phase behavior of the integer bench- 
mark gzip. The x-axis of the figure is in billions of instructions, 
as is the case in Figure 1. Each point on the y-axis represents an 
entry of the phase tracker's accumulator table. Each point on the 
graph corresponds to the value of the corresponding accumulator 
table entry at the end of a profiling interval. Dark values repre- 
sent high execution frequency, while light values correspond to 
low frequency. The same trends that were seen in Figure 1 for 
gzip can be clearly seen in Figure 3. In both of these figures, 
when observing them at the coarsest granularity, we can see that 
there are at least three different phases labeled A, B and C. In 
Figure 3, the phase tracker table entries 2 , 5, 7, 13 and 
17 distinguish the two identical long running phases labeled A 
from a group of three long running phases labeled C. Phase table 
entries 12 and 2 0 clearly distinguish phase B from both A and 
C. This figure is pictorial evidence that the phase tracker is able 
to break the program's execution into the corresponding phases 
based solely on the executed code, and that these phases corre- 
spond to the behavior seen across the different program metrics 
in Figure 1. 

4.4 Forming a Footprint 

After the profiling interval has elapsed, and branch block infor- 
mation has been accumulated, the phase must then be classified. 
To do this we keep a history of past phase information. 

If we fix the number of instructions for a profiling interval, 
then we can divide each bucket by this fixed number to get the 
percentage of execution that was accounted far by all instruc- 
tions mapped to that bucket. However, we do not need to know 
the exact percentages for each bucket. Instead of keeping the 



4 



i 




OB 50B 1Q0B 



Figure 3: Visualization of the accumulator table used to track 
program behavior for gzip. Tfie x-axis is in billions of instruc- 
tions, while they-axis is the entry of the accumulator table. Each 
point on the graph corresponds to the value of the accumulator 
table at the end of a profiling interval where dark values corre- 
spond to more heavily accessed entries. The same trends that 
were seen in Figure J can be clearly seen in Figure 3. 

full counter values, we can instead compress phase information 
down to a couple of the most significant bits. This compressed 
information will then be kept in the Past Footprint table as shown 
in Figure 2. 

The number of counter value bits that we need to observe is 
related to Nbuckets. As we increase the number of buckets, the 
data is spread over more buckets (table entries), making for less 
entries per bucket (better resolution) but at the cost of more area 
(both in terms of number of buckets and more bits per bucket). 
To be on the safe side, we would like any distribution of data into 
buckets to provide useful information. To achieve this we need 
to ensure that even if data is distributed perfectly evenly over 
all of the buckets, we would still record information about the 
frequency of those buckets. This can be achieved by reducing 
the accumulator counter by: 

(bucket[i] x Nbuckets) /(intervalsize) 

If the number of buckets and interval size are powers of two, 
this is a simple shift operation. For the number of buckets we 
have chosen (32), and the interval size we profile over, this re- 
duces the bucket size to 6 bits, and thus requires 24 bytes of stor- 
age for each unique phase in the Past Footprint table. In practice 
we see that the top 6 bits of the counter are more than enough 
to distinguish between two phases. In the worst case, you may 
need one or two more bits to reduce quantization error, but in 
reality we have not seen any programs that cause this to be an 
issue. 

If too few buckets are used, aliasing effects can occur due 
to the hashing function, where two different phases will appear 
to have very similar Footprints. Therefore, we want to use a 
sufficiently large number of buckets to uniquely identify the dif- 
ferences in code execution between phases, while at the same 
time use only a small amount of area. 

To examine the aliasing effect and determine what the appro- 
priate number of buckets should be, Figure 4 shows the sum of 
the differences in the bucket weights found between all sequen- 
tial intervals of execution. The y-axis shows the sum total of 
differences for each program. This is calculated by summing the 




Number of Counters 

Figure 4: The percent difference found between Footprints from 
sequential intervals of execution, when varying the number of 
counters used to represent the footprints. The results are nor- 
malized to the difference between intervals found when having 
an infinite number of buckets to represent the footprint; 32 pick- 
ets captures most of the benefit. 

differences between the buckets captured for interval i and i — 1 
for each interval i in the program. The x-axis is the number of 
distinct buckets used. All of the results are compared to the ideal 
case of using an infinite number of buckets (or one for each sep- 
arate basic block) to create the Footprint On the program gcc 
for example, the total sum of differences with 32 buckets was 
72% of that captured with an infinite number of buckets. In gen- 
eral we have foundlhat 32 buckets was enough to distinguish 
between two phases. 

4.5 Classifying a Footprint to a Phase ED 

After reducing the vector to form a footprint, we begin the clas- 
sification process by comparing the footprint to a set of repre- 
sentative past footprint vectors. We compare the current vector 
to each vector in the table. The next section details how we per- 
form the comparison and determine what a match is. If there is 
a match, we classify the profiled section of execution into the 
same phase as the past footprint vector, and the current vector 
is not inserted into the past footprint table. If there is no match, 
then we have just detected a new phase and hence must create a 
new unique phase ID into which we may classify it. This is done 
by choosing a unique phase ID out of a fixed pool of IDs. When 
allocating a new phase ID, we also allocate a new past footprint 
entry, set it to the current vector, and store with that entry the 
newly allocated phase ED. This allows future similar phases to 
be classified with the same ID. In this way only a single vector 
is kept for each unique phase ID, to serve as a representative of 
that phase. After a phase ED is provided for the most recent in- 
terval, it is passed along to prediction and statistic logging, and 
the phase identification part of our algorithm is completed. 

To examine the number of phase IDs we need to track, Fig- 
ure 5 shows the percentage of execution that can be accounted 
for by the top p phases, where p is shown on the x-axis. Re- 
sults are graphed for the programs that had the min (galgel) 
and max (art) coverage, gcc, g zip, and the overall average. 
These results show that most of the program's phase behavior 
can be captured using a relatively small number of phase IDs. 



5 



( 




Number of Hardware Detected Phases 

Figure 5: Results of the minimum number of phases that need 
to be captured versus the amount program execution they cover. 
The y-axis is the percent of program execution that is covered. 
Tfie x-axis is the minimum number of phases needed to capture 
that much program execution. 

If we only track and optimize for the top 20 phases in each ap- 
plication, we will capture and be able to accurately apply phase 
prediction/optimizations to over 90% of the program's execution 
on average. In the worst case (min), we are able to optimize most 
of the program Cover 80%) by only targeting a small number (20) 
of important recurring phases. 

4.S.1 Finding a Match 

We search through the Footprint histories to find a match, but 
this query is complicated by the fact that we are not necessar- 
ily searching for an exact match. Two sections of execution that 
have very similar footprints could easily be considered a match, 
even if they do not compare exactly. To compare two vectors 
to one another, we use the Manhattan distance between the two, 
which is the element-wise sum of the absolute differences. This 
distance is used to determine if the current interval should be 
classified as the same phase ID as one of the past footprint inter- 
vals. 

If we set the distance threshold too low, the phase detection 
will be overly sensitive, and we will classify the program into 
many, very tiny phases which will cause us to lose any bene- 
fit from doing run-time phase analysis in the first place. If the 
threshold is too high, the classifier will not be able to distinguish 
between phases with different behavior. To quantify this effect, 
we examine haw well our hardware technique classifies phases 
for a variety of thresholds compared to the phases found by the 
off-line clustering algorithm used in SimPoint [21]. 

The SimPoint tool is able to make global decisions to opti- 
mize the grouping of similar intervals into phases. The off-line 
algorithm makes no use of thresholds, instead its decisions are 
based solely on the structure found in the distribution of pro- 
gram behaviors. Our technique must be far more simplistic be- 
cause it must be performed on-line and with limited computa- 
tional overhead. This reduction in complexity comes at the cost 
of increased error. 

The Different Phases line in Figure 6 shows the ability of 
our hardware technique to find phase changes (transitions be- 
tween one phase and the next) when different thresholds are used 




12 13 14 15 16 17 18 19 20 21 22 23 24 
Lg Distance Threshold 

Figure 6: Results showing how well our hardware phase tracker 
classifies Avo sequential intervals of execution as being from 
"Different " or the "Same " phase of execution. The percent of 
misclassifications are shown in comparison to the phase classi- 
fications found using the off-line clustering SimPoint tool [21]. 

to perform the phase classification. For example, when using a 
Manhattan distance of 1 million as our threshold (shown as 20 
on our x-axis because it is in log 2 ), our hardware technique iden- 
tified 80% of the phase changes that occurred in the more com- 
plex off-line SimPoint analysis. Conversely, 20% of the phase 
changes were incorrectly classified as having the same phase ID 
as the last interval of execution. 

Likewise, the Same Phases line in Figure 6 represents the 
ability of our hardware technique to accurately classify two se- 
quential intervals as being part of the same phase as a function 
of different thresholds (again as compared to the off-line cluster- 
ing analysis). For example, when using a Manhattan distance of 
1 million (shown as 20 on the x-axis), our hardware technique 
identified 80% of the intervals that stayed in the same phase 
as correctly staying in the same phase, but 20% of those inter- 
vals were classified as having a different phase ID from the prior 
phase. 

A misclassification occurs when two sequential intervals of 
execution are classified as being in the same phase or in different 
phases using our hardware approach when the off-line clustering 
analysis tool found the opposite for these two intervals. 

If we are too aggressive and our hardware phase analysis in- 
dicates that there are phase changes when there are actually no 
noticeable changes in behavior, then we will create too many 
phase IDs that have similar behavior. This can create more over- 
head for performing phase-based optimization. On the other 
hand, if we are too passive in distinguishing between different 
phases, we will be missing opportunities to make phase specific 
optimizations. 

In order to strike a balance between having a high capture 
rate and reducing the percent of false positives, we chose to use 
a threshold of 1 million. When comparing this with the interval 
size of 10 million instructions, this means that a difference in the 
phase behavior will be detected if 10% of the executed instruc- 
tions are in different proportions. In choosing 1 million, we have 
on average a 20% misclassification rate. Note, that a misclassi- 
fication does not necessarily mean that an incorrect optimization 



6 



will be performed. For example, if we have a "Same Phase" mis- 
classification (the two intervals were really from the same phase, 
but were classified into different phases), then a phase change is 
observed using our hardware technique when there was not one 
in the baseline classifier. If the two hardware detected phases 
have the same optimization applied to them, then this misclassi- 
fication can have no effect. 

4.6 Fer-Phase Performance Metric Homogeneity 

Using the techniques presented above, we can perform phase 
classification on programs at run-time with little to no impact 
on the design of the processor core. One of the goals of phase 
classification is to divide the program into a set of phases that are 
fairly homogeneous. This means that an optimization adapted 
and applied to a single segment of execution from one phase, 
will apply equally well to the other parts of the phase. In order to 
quantify the extent to which we have achieved this goal, we need 
to test the homogeneity of a variety of architectural statistics on 
a per-phase basis. 

Figure 7 shows the results of performing this analysis on the 
phases determined at run-time. Due to space constraints we only 
show results for two of the more complicated programs gcc and 
gzip. For both programs, a set of statistics for each phase is 
shown. The first phase that is listed (separated from the rest) as 
full, is the result of classifying the entire program into a single 
phase. The results show that for gcc for example, the average 
IPC of the entire program was 1.32, while the average number 
of cache misses was 445,083 per ten million instructions. In 
addition to just the average value, we also show the standard 
deviation for that statistic. For example, while the average IPC 
was 1.32 for gcc, it varied with a standard deviation of over 
43% from interval to interval. If the phase-tracking hardware is 
successful in classifying the phases, the standard deviations for 
the various metrics should be low for a given phase ID. 

Underneath the phase marked full are the five most fre- 
quently executed phases from the program as identified by our 
phase tracker. The phases are weighted by the percentage of the 
program's executed instructions they account for. For gcc, the 
largest phase accounts for 18.5% of the instructions in the entire 
program and has an average IPC of 0.61 and a standard devi- 
ation of only 1.6% (of 0.61). The other top four phases have 
standard deviations at or below this level, which means that our 
technique was successful at dividing up the execution of gcc 
into large phases with similar execution behavior with respect to 
IPC. Note, that some metrics for certain phases have a high stan- 
dard deviation, but this occurs for architecture features/metrics 
that are unimportant for that phase. For example, the phase that 
occurs for 7.2% of execution in gcc has only 75 LI instruction 
cache misses on average. This is an LI miss rate of 0.00075%, 
so an error of 215% for this metric will not likely have any effect 
on the phase. 

When we look at the energy consumption of gcc, it can be 
observed that energy consumption swings radically (a standard 
deviation of 90%) over the complete execution of the program. 
This can be seen visually in Figure 1, which plots the energy 
usage versus instructions executed. However, after dividing the 
program into phases, we see that each phase has very little vari- 



ation within itself. All have less than 2% standard deviation. 
By analyzing gcc it can also be seen that the phase partitioning 
does a very good job across all of the measured statistics even 
though only one metric is used. This indicates that the phases 
that we have chosen are in some way representative of the actual 
behavior of the program. 

5 Phase Prediction 

The prior section described our phase tracking architecture, and 
how it can be used to classify phases. In this section we focus on 
using phase information to predict the next phase. For a variety 
of applications it is important to be able to predict future phase 
changes so that the system can configure for the code it will soon 
be executing rather than simply reacting to a change in behavior. 

Figure 8 shows the percentage interval transitions that are 
changes in phase, for our set of benchmarks. For all of these pro- 
grams, phase changes come quite often, but it should be noted 
that this statistic alone cannot gauge the complexity of the pro- 
gram behavior. The program gcc switches less than 10% of 
the time but switches between many different phases. The other 
extreme is art which switches almost half the time, but it is 
only switching between a few distinct phases. In this case, large 
repeating patterns can be observed. No two phases executing se- 
quentially are that similar, but there is an order to the sequence. 
By adding in a prediction scheme for these cases, we not only 
take advantage of stable conditions as in past research, but actu- 
ally take advantage of any repeating patterns in program behav- 
ior. 

5.1 Markov Predictor 

The prediction of phase behavior is different from many other 
systems in which hardware predictors are used. Because of this 
new environment, a new type of predictor has the potential to 
perform better than simply using predictors from other areas of 
computer architecture (branch and address prediction, memory 
disambiguation, etc.). 

After observing the way that phases change, we determined 
that two pieces of information are important. First, the set of 
phases leading up to the prediction are very important, and sec- 
ond, the duration of execution of those phases is important. 

A classic prediction model that is easily implementable in 
hardware is a Markov Model, Markov Models have been used 
in computer architecture to predict both prefetch addresses [13] 
and branches [8] in the past. The basic idea behind a Markov 
Model is that the next state of the system is related to the last set 
of states. 

The intuition behind this design is that phase information 
tends to be characterized by many sections of stable behavior 
interspersed with abrupt phase changes. The key is to be able to 
predict when these phase changes will occur, and to know ahead 
of time what phase they will change to. The problem is that the 
changes are often preceded by stable conditions, and if we only 
consider the last couple of intervals we will not be able to tell 
the difference between sections of stable behavior that precede 
a phase change, and those sections that will continue to be sta- 
ble. Instead, we need a way of compressing down stable phase 



7 



I 




346B5 (22.0%) 



1304B [3,9%) 



4450B3 (110.7%)| 50763 (203.2%) 



7533B2 (5.4%)| 125091 (23.2%) 



2S112 (15.1%)| 



6.44E+0B (90,0%) 227912 (139.7%) 



1,03E-t-09 (1.a%)| 395997 (5.3%) 



3.22E+0B (0.2%)| 10D6 (5.6%) 



9.78E+0B (0.3%)| 443655 (0,1%) 



ie|^^^^tddev)| bpred (stddev)| 



^xjdevQl 111 (stddev)l energy (stddev)| 



) | 3540B4 (7.0%) 



54791 (6.6%) 



(11.1%) 



99671 (11.9%) 



(9.6%) 



241 (B.4%) 



40 (25.7%) 



12 (35.4%) 



5.05E+08 (3.5%) 



5.09E+03 (3,B%) 



3.55E+oa (0.6%) 



5,14E+08 (4.4%) 



5.04E+0B (3.2%) 



2451 B (9.3%) 



5617 (15.6%) 



2B153 (11.0%) 



23701 (8.4%) 



Figure 7: Examination of per-phase homogeneity compared to the program as a whole (denoted by full/ For the two programs 
and each of the tap 5 phases of each program, we show the average value of each metric and the standard deviation. Hie name 
of the phase is the percent of execution that it accounts for in terms of instructions. These results show that after dividing up the 
pmgram into phases using our run-time scheme the behavior within each phase is quite consistent. 



» 60%- 
I 

^ 40% H 

in 

ra 

£ 20% 



i.Lii 



8 1 I * 



Figure 8: TJie percent of execution intervals that transition to 
a different phase from the prior execution interval's phase as 
found by our phase tracking architecture with 32 footprint coun- 
ters using a 1 million Manhattan threshold. 

information into a piece of information that we can use as state. 

5.2 Run Length Encoding Markov Predictor 

To compress the stable state we use a Run Length Encoding 
(RLE) Markov predictor. The basic idea behind the predictor 
is that it uses a run-length encoded version of the history to in- 
dex into a prediction table. The index into the prediction table is 
a hash of the phase identifier and the number of times the phase 
identifier has occurred in a row. 

Figure 9 shows our RLE Markov Phase ID prediction archi- 
tecture. The the lower order bits of the hash function provide an 
index into the prediction table, and the higher order bits of the 
hash function provide a tag. When there is a tag match, the phase 
ID stored in the table provides a prediction as to the next phase 
to occur in execution. When there is a tag miss, the prior phase 
ID is assume to be the next phase ID to occur in the program's 
execution. We found that predicting the last phase ID to be 75% 
accurate on average. 

We only update the predictor when there is (1) a change in 
the phase ID, or (2) when there is a tag match. We only insert an 
entry when there is a phase ID change, since we want to predict 



Markov Table 








tug 


ID 



















Figure 9: Phase Prediction Architecture for the Run Length En- 
coded (RLE) Markov predictor. The basic idea behind the pre- 
dictor is that two pieces of information are used to generate the 
prediction, the phase id that was just seen, and the number of 
times prior to now that it has been seen in a row. The index into 
the prediction table is a hash of these two numbers. 

when the phase is going to change. Execution intervals where 
the same phase ID occurs several times in a row do not need 
to be stored in the table, since they will be correctly predicted 
as "last phase ID", when the there is a table miss. This helps 
table capacity constraints and avoids polluting the table with last 
phase predictions. For the second update case, when there is 
a tag match, we update the predictor because the observed run 
length may have potentially changed. 

5.3 Predictor Comparison 

We compare our RLE Markov phase predictor with other pre- 
diction schemes in Figure 10. This Figure has four bars for ev- 
ery program, and each bar corresponds to the prediction accu- 
racy of a prediction architecture. The first and simplest scheme, 
Last Phase, simply predicts that the next phase is the same as 
the current phase, in essence always predicting stable operation. 
The prediction accuracy of this scheme is inversely proportional 
to the rate at which phases change in a given benchmark. For 
the program gzip for example, there are long periods of execu- 
tion where the phase does not change, and therefore predicting 
no-change does exceptionally well. 

In order to insure that we were not simply providing an 



50% 




Figure 10: Phase ID Prediction Accuracy. This figure shows 
how well different prediction schemes work. The mast naive 
scheme, last, simply predicts that the phases never change. 
The bars marfced Markov and RLE Markov show how well 
we can predict the phase identifiers if we use a Markov predic- 
tion scheme with a Markov table size of 256 entries. 

expensive filter for noise in the phase IDs, we also compared 
against a simple noise filter which works by predicting that the 
next phase will be the most commonly occurring of the last three 
phases seen. This is not shown, as it actually performed worse 
on all of the programs. 

Additionally we wanted to examine the effect of a simple 
Markov model predictor for history lengths of 1 and 2. The 
Markov model predictor does a better job of predicting phase 
transitions than Last Phase, but it is limited by the Fact that 
long runs will always be predicted as infinitely stable due to the 
history filling up. However, it is still very effective for f acerec 
and applu, but does not provide much benefit for either art or 
galgel. 

The final bar, RLE Markov, is our improved Markov pre- 
dictor which compresses stable phases into a tuple of phase 
id and duration. All of the Markov predictors simulated had 
256 entries taking up less than 500 bytes of storage. Using 
RLE Markov outperforms both the Last Phase and tradi- 
tional Markov on all the benchmarks. It performs especially 
well compared to other schemes on both applu and art. Over- 
all, using a Run-Length Encoded Markov predictor can cut the 
phase mispredictions down to 14% on average. 

6 Applications 

This section examines three optimization areas in which a phase- 
aware architecture can provide an advantage. We begin by ex- 
amining the relationship between phase behavior and value lo- 
cality. We then demonstrate ways to reduce processor energy 
consumption by adjusting the aggressiveness of the data cache 
and the instruction front end. 

6.1 Frequent Value Locality 

Prior work on value predictors has shown that there is a great 
deal of value locality in a variety of programs [14, 7]. Recently, 
researchers have started to take advantage of frequently loaded 



values for the purpose of optimizing caches. For example, Yang 
and Gupta [22] proposed a data cache organization that com- 
presses the most frequently used program values in order to save 
energy. Another way of exploiting value locality is through value 
specialization, which can be done either statically or dynami- 
cally [6, 17, 16] to create specialized versions of procedures or 
code-regions based upon the values frequently seen. These tech- 
niques are built on the idea of finding the mast frequent values 
for loads over the whole program, and then specializing the pro- 
gram to those frequent values. 

We examine the potential of capturing frequent values on a 
per-phase basis and compare this to the frequent values aggre- 
gated over the entire program, as would be used in value code 
specialization [6], To perform this experiment we first gathered 
the top 16 values that were loaded over the complete execution 
of the program and stored them into a table, We then examined 
the percentage of executed loads that found their loaded value in 
this table, This result is shown as Static in Figure 11. While 
significant portions of some programs are covered by just these 
few top values (such as applu), over half of the programs have 
less than 10% of their loaded values covered by these top values. 

The question is: can we do better by exploiting hardware- 
detected phase information? To answer this question we take the 
top 16 values for each phase, as detected by the hardware phase 
tracker. These top values will be shared across a single phase 
even if it is split into two or more different sections of execution. 
Each load in the program is then checked against the top val- 
ues for its corresponding phase. The Phase Coverage bar 
in Figure 11 shows the percent of all load values in the program 
that were successfully matched to it's per-phase top value set. 

Without any notion of loads or values, our method of divid- 
ing up phases is very successful at assisting in the search for fre- 
quent values. By just tracking the top 16 values of each phase, 
we ore able to capture the values from almost 50% of the exe- 
cuted loads on average. The Perfect bar shows percentage of 
loads covered if one captures the top 16 load values for each and 
every interval (i.e., 10 million instructions) separately. This is in 
effect the best that we could hope to achieve for an interval size 
of 10 million instructions, because the 16 entries in the value ta- 
ble are custom crafted for each interval individually. As shown 
in Figure 11, the phase-tracker compares favorably with the op- 
timal coverage. Two thirds of the total possible benefit from 
per-interval value locality can be captured by per-phase value 
locality. It is important to point out this graph by itself is not 
a good indicator of usefulness as near perfect coverage could 
be achieved simply by making every interval a separate phase. 
However, as shown in Figure 5 only a few phases (around 20) 
are used to cover at least 80% of the program's execution. 

6.2 Dynamic Data Cache Size Adaptation 

In a modern processor a significant amount of energy is con- 
sumed by the data cache, but this energy may not be put to 
good use if an application is not accessing large amounts of data 
with high locality. To address this potential inefficiency, previ- 
ous work has examined the potential of dynamically reconfigur- 
ing the data caches with the intention of saving power. In [2], 
Balasubramonian et. al. present two different schemes with 



9 




Figure 11: The percent of the program 's load values that are 
found in a table of the most frequently values loaded over the 
whole program (Static Coverage), on a per-phase basis (Phase 
Coverage), and on a per execution interval basis (Optimal Cov- 
erage). 

which re-configuration may be guided. In one scheme, hard- 
ware performance counters are read by re-configuration software 
every hundred thousand cycles. The software then makes a de- 
cision based on the values of the counters. In another scheme, 
re-configuration decisions are performed on procedure bound- 
aries instead of at fixed intervals. To reduce the overhead of re- 
configuration, software to trigger re-configuration is only placed 
before procedures that account for more than a certain percent- 
age of execution. 

Another form of re-configurable cache that has been pro- 
posed dynamically divides the data cache into multiple parti- 
tions, each of which can be used for a different function such 
as instruction reuse buffers, value predictors, etc [18]. These 
techniques can be triggered at different points in program exe- 
cution including procedure boundaries and fixed intervals. The 
overhead of re-configuration can be quite large and making these 
policy decisions only when the large scale program behavior 
changes, as indicated by phase shifts in our hardware tracker, 
can minimize overhead while guaranteeing adequate sensitivity 
to attain maximum benefit. 

We examined the use of phase tracking hardware to guide an 
energy aware, re-sizable cache. The energy consumption of the 
data cache can be reduced by dynamically shifting to a smaller, 
less associative cache configuration for program phases that do 
not benefit significantly from more aggressive cache configura- 
tions. By targeting only those phases that are predicted to have 
energy savings due to cache size reduction, our scheme is able 
to reduce power with very little impact on the performance. 

We examined an architecture with two possible cache con- 
figurations, 32KB 4-way associative and 8KB direct mapped. In 
Figure 12, the trade off between these two configurations is plot- 
ted. For each program, we use the 32KB cache configuration as 
the baseline result. The labeled circles in Figure 12 show the 
total processor energy savings and performance degradation for 
each program if only the smaller (8KB) cache size is used. For 
example, a processor with a smaller cache configuration for the 
program applu is both 5% slower and uses 5% less energy. 



10% 




8% 


in 








1 


6% 


to 




>% 




1 


4% 












2% 




0% 



0 applu 2 art 4 facerec 6 gcc 

1 apsl 3 bzlp2 5 galgel 7 gzlp 



8 mcf 

9 vpr 



A 

3 

A 1 



OSniall Cache 
APhase Aware 



Figure 12: Data Cache Re-configuration. The tradeoff between 
energy savings and slowdown for two different cache policies. 
All results are relative to a 32KB 4-way associative cache. The 
circles in the graph (each labeled with a number for the program 
the data point is from) show the energy and performance of an 
8KB direct mapped cache. Tlie triangles show the tradeoff of in- 
telligently switching between an 8KB direct mapped and a 32KB 
4-way data cache based on phase classification and prediction. 

Two programs, vpr and apsi, actually use mare energy with a 
smaller cache due to large slow downs. These two points are off 
the scale of this graph and are not shown. 

While examining energy savings and slow down is interest- 
ing, it is important to note that there is more than one way to 
reduce both energy and performance. Voltage scaling in particu- 
lar has proven to be a technology capable of reaping large energy 
savings for a relative reduction in performance. For our results, 
we assume that for voltage scaling a performance degradation of 
5% will yield an approximate energy saving of 15%. We use this 
rule of thumb as our guideline for determining when to reduce 
the active size of the cache. In Figure 12, this simple model of 
voltage scaling is plotted as a dashed line. When the cache size 
is reduced, most programs fall far short of this baseline, meaning 
that voltage seating would provide a better performance-energy 
tradeoff. There are a couple of exceptions, in particular mcf, 
bzip, and g zip do well even without any sort of phase-based 
re-configuration. 

The shaded triangles in Figure 12 show what happens if 
we use phase classification and prediction to guide our re- 
configuration. When a new phase ID is seen, we sample the IPC 
and energy used for a few intervals using the 32KB 4-way cache, 
and a few intervals for the 8KB direct mapped cache. These sam- 
ples could be kept in a small hardware profiling table associated 
with the phase ID. After taking these samples, if we find that a 
particular phase is able to achieve more than three times the en- 
ergy savings relative to the slow down seen when using the 8KB 
cache, we then predict for this phase JD that the smaller cache 
size should be used. This heuristic means that the small cache 
size is used only if re-configuration would beat voltage scaling 
for that phase. After a decision has been made as to the con- 



10 



0 applu 

1 apst 



2 art 4 facerec 6 gcc 

3 bzip2 5 galgel 7 gzfp 



8 mcf 

9 vpr 



§>20% 



'7,4 



O Low Issue 
A Phase Aware 



5% 10% 15% 20% 

Slowdown 

Figure IB: Processor Width Adaptation. The tradeoff between 
energy savings and slowdown for two different front end poli- 
cies. All results are relative to an aggressive 8-issue machine. 
The circles in the graph (each labeled with a number for the 
program) show the energy and performance of a less aggressive 
2-issue processor. The triangles show using the phase classifier 
and predictor for switching between 2— issue and 8-issue based 
on phase changes. 

figuration to use for a phase ID, the corresponding cache size is 
stored in the phase profiling table/database associated with that 
phase ID. The phase classifier and predictor are then used to pre- 
dict when a phase change occurs. When a phase change predic- 
tion occurs, the predicted phase ID looks up the cache size in the 
profiling table, and re-configures the cache (if it is not already 
that size) at the predicted phase change. 

For all programs, our re-configuration is able to beat 
or tie voltage scaling. Far example, using phase-based re- 
configuration results in a slowdown of 0.5% for applu, while 
the total energy savings is 4.5%. Even the program apsi, which 
had increased energy consumption in the small cache configura- 
tion, is able to get almost 5% energy savings with only a 1% 
slowdown. 

6.3 Dynamic Processor Width Adaptation 

One way to reduce the energy consumption in a processor is to 
reduce the number of instructions entering the pipeline every cy- 
cle [12, 1]. We call this adjusting the width of the processor. 
Reducing the width of the processor reduces the demand on the 
fetch, decode, functional units, and issue logic. Certain phases 
can have a high degree of instruction level parallelism, whereas 
other phases have a very low degree. Take for example the top 
two phases for gcc shown in Figure 7. The intervals classified 
to be in the first phase consisting of 18.5% of execution have an 
IPC of 0.61 with a high data cache miss rate. In comparison, 
the intervals in the second most frequently encountered phase 
(accounting for 18.1% of execution) have an IPC of 1.95 and 
very low data cache miss rates. We can potentially save energy 
without hurting performance by throttling back the width of the 
processor for phases that have low IPC, while still using aggres- 
sive widths for phases with high IPC. 



In the current literature, decisions to reduce or increase the 
fetch/decode/issue bandwidth of the processor are made either 
at fixed intervals (relatively short intervals such as 1,000 cy- 
cles) [12] or, as in the case of branch confidence based schemes, 
when a branch instruction is fetched [1]. It can very difficult to 
design real systems that save energy by reconfiguring at these 
speeds, but a hardware phase-tracker can help make these deci- 
sions at a coarser granularity while still maintaining performance 
and energy benefits. 

We examined an architecture that could be configured with 2 
different widths - one where up to 2 instructions are decoded and 
up to 2 issued per cycle, and one where up to 8 instructions are 
decoded and up to 8 issued per cycle. When a new phase ID is 
seen by the phase tracker, we sample the IPC for three intervals 
with a width of 2 instructions, and three intervals with a width 
of 8 instructions. If there is little difference in the IPC between 
these two widths, then we assign a width of 2 instructions to this 
Phase ID in our profiling table, otherwise we assign a width of 
8 instructions. During execution, we use the phase ID predictor 
to effectively predict the width for the next interval of execution 
and adjust the processor's width accordingly. Our results show 
that the chosen configuration for a given phase can be trained 
(1) with only a few samples, and (2) only once to accurately 
represent the behavior of a given phase ID. This requires very 
little training time due to the fact that 20 or fewer phase IDs 
are needed to capture 80% or more of a program's execution as 
shown in Figure 5. 

Figure 13 is a graph of the results seen when applying phase- 
directed width re-configuration. The white circles in the graph 
show the behavior of running the programs on only a 2-wide 
machine relative to the more aggressive 8-wide machine. The 
dotted line again shows what could potentially be achieved if 
voltage scaling was used. While mcf and art save a lot of en- 
ergy with little performance degradation on a 2-wide machine, 
the other programs do not fair as well. The program apsi, for 
example, has a slowdown of over 22% with an energy savings of 
around 30%. This does not compare favorably to voltage scal- 
ing (as discussed in Section 6.2). On the other hand if we use 
phase-directed width throttling on apsi, a total processor en- 
ergy savings of 18% can be achieved with only 2.2% slowdown. 

For all of the programs we examined, with one exception, 
the slowdown due to phase aware width throttling was less than 
4%, while the average energy savings was 19.6%. This result 
demonstrates that there is significant benefit to be had in the re- 
configuration of processor front end resources even at very large 
granularities. In the worst case, this will mean a re-configuration 
every 10 million instructions, and on average every 70 million 
instructions. This should be designable even under conservative 



7 Summary 

In this paper we present an efficient run-time phase tracking ar- 
chitecture that is based on detecting changes in the code being 
executed. This is accomplished by dividing up all instructions 
seen into a set of buckets based on branch PCs. This way we ap- 
proximate the effect of taking a random projection of the basic 



11 



block vector, which was shown in [21] to be an effective method 
of identifying phases in programs. 

Using our phase classification architecture with less than 500 
bytes of on-chip memory, we show that for most programs, a sig- 
nificant amount of the program (over 80%) is covered by 20 or 
less distinct phases. Furthermore, we show that these phases, 
while being distinct from one another, have fairly uniform be- 
havior within a phase, meaning that most optimizations applied 
to one phase will work well on all intervals in that phase. In the 
program gcc, the IPC attained by the processor on average over 
the full run of execution is 1.32, but has a standard deviation 
of more than 43%. By dividing it up into different phases, we 
achieve much more stable behavior, with IPCs ranging between 
0.61 and 1.95, but now with standard deviations of less than 2%. 

In addition to this, we present a novel phase prediction archi- 
tecture using a Run Length Encoding Markov predictor that can 
predict not only when a phase change is about to occur, but to 
which phase ID it will transition to. In using this design, which 
also uses less than 500 bytes of storage, we achieve a phase 
prediction miss rate of 10% for applu and 4% for apsi. In 
comparison, always predicting that the phase will stay the same 
results in a miss rate of 40% and 12% respectively. 

We also examined using our phase tracking and prediction 
architecture to enable new phase-directed optimizations. Tra- 
ditional architecture and software optimizations are targeted at 
the average or aggregate behavior of a program. In comparison, 
phase-directed optimizations aim at optimizing a program's per- 
formance tailored to the different phases in a program. In this pa- 
per, we examined using phase tracking and prediction to increase 
frequent value profiling coverage, and to provide energy savings 
through data cache and processor width re-configuration. 

We believe our phase tracking and prediction design will 
open the door for a new class of run-time optimization that tar- 
gets large scale program behavior. Even though we present a 
hardware implementation for phase tracking, a similar design 
can be implemented in software to perform phase classification 
for run-time optimizers, just-in-time compilation systems, and 
operating systems. Hardware and software optimizations that 
can potentially benefit the most from phase classification and 
prediction are (1) those that need expensive profiling/training 
before applying an optimization, (2) those where the time or 
cost it takes to perform the optimization is either slow or ex- 
pensive, and (3) those that can benefit from specialization where 
they have the same code/data being used differently in different 
phases of execution. By using our dynamic phase tracking and 
prediction design, phase-behavior can be characterized and pre- 
dicted at the largest of scales, providing a unified mechanism for 
phase-directed optimization. 

Acknowledgments 

We would like to thank Jeremy Lau and the anonymous review- 
ers for providing useful comments on this paper. This work 
was funded in part by NSF CAREER grant No. CCR-9733278, 
Semiconductor Research Corporation grant No. SRC-2001-HJ- 
897, and an equipment grant from Intel. 



References 

[1] J.L. Aragon, J. Gonzalez, and A. Gonzalez. Power-aware control specula- 
tion through selective throttling. In Proceedings of the Ninth International 
Symposium art High-Performance Computer Architecture, February 2003. 

[2] R. Balasubramonian, D. H. Albonesi, 

A. Buyuktosunoglu, and S. Dwnrkndas. Memory hierarchy reconfiguration 
for energy and performance in general-purpose processor architectures. In 
33rd International Symposium on Microarchitecture, pages 245-257, 2000. 

13] R. D. Barnes, E. M. Nystrom, M. C. Merten, and W. W. Hwu. Vacuum 
packing: Extracting hardware-detected program phases for post-link opti- 
mization. In 35th International Symposium on Microarchitecture, Decem- 
ber 2002. 

[4] D. Brooks, V. Tiwari, and M. Martonosi. Wottch: tt framework for 
architectural-level power analysis and optimizations. In 27th Animal In- 
ternational Symposium on Computer Architecture, pages 83-54, June 2Q0D. 

[5] D. C. Burger and T. M. Austin. The simplcscalnr tool set, version 2.0. 
Technical Report CS-TR-97-1342, U. of Wisconsin, Madison, June 1997. 

[6] B. Calder, P. Feller, and A. Eustace. Value profiling and optimization. Jour- 
nal of Instruction Level Parallelism, March 1999. 

[7] B. Cnlder, G. Reinman, and D.M. Tullsen. Selective value prediction. In 
26th Annual International Symposium on Computer Architecture, pages 64— 
74, June 1999. 

[8] L-C. Chen, J. T. Coffey, and T. N. Mudge. Analysts of branch prediction 
via data compression. In Seventh Internationa! Conference on Architectural 
Support for Programming Languages and Operating Systems, pages 128- 
137, October 1996. 

[9] A. Dhodapkar and J. E. Smith. Dynamic microarchitecture adaptation via 
co-designed virtual machines. In International Solid State Circuits Confer- 
ence, February 2002, 

[10] A. Dhodapkar and J. E. Smith. Managing multi-configuration hardware via 
dynamic working set analysis. In 29th Annual International Symposium on 
Computer Architecture, May 2002. 

[11] M. Huang, J. Renau, and J. Torrellas. Profile-based energy reduction in 
high-performance processors. In 4th Workshop on Feedback- Directed and 
Dynamic Optimization (FDDO-4), December 2001. 

[12] A. Iyer and D. Marculescu. Power aware microarchitecture resource scaling. 
In Proceedings of the DATE 200 1 an Design, automation and test in Ewope, 
pages 190-196. 2001. 

[13] D. Joseph and D. Grunwald. Prefetching using markov predictors. In 24th 
Annual International Symposium on Computer Architecture, June 1997. 

[14] M.H. Lipasti, C.B. Wilkerson, and J.P. Shen. Value locality and load value 
prediction. In Seventh International Conference on Architectural Support 
for Programming Languages and Operating Systems, pages 138-147, Oc- 
tober 1996. 

[15] M, Merten, A. Trick, R. Barnes, E. Nystrom, C. George, J. Gyllenhaal, and 
Wen mei W. Hwu. An architectural framework for run-time optimization, 
IEEE Transactions on Computers, 50(6):567-589, June 2001. 

[16] M. Mock, C. Chambers, and SJ. Eggcrs. Culpa: a tool for automating 
selective dynamic compilation. In 33rd International Symposium on Mi- 
croarchitecture, pages 291-302, December 2000. 

[17] R. Muth, S.A. Wattcrson, and S.K. Debray. Code specialization based on 
value profiles. In Static Analysis Symposium, pages 340-359, 2000. 

[18] P. Ranganalhan, S. V. Adve, and N.P. Jouppi, Reconfigurable caches and 
their applicati Dn to media processing. In 2 7th Annua! International Sympo- 
sium on Computer Architecture, pages 214-224, June 2000. 

[19] T. Sherwood and B. Calder. Time varying behavior of programs. Technical 
Report UCSD-CS99-63Q, UC San Diego, August 1999. 

[20] T. Sherwood, E. Perelmttn, and B. Calder. Basic block distribution analysis 
to find periodic behavior and simulation points in applications. In Interna- 
tional Conference on Parallel Architectures and Compilation Techniques, 
September 2001. 

[21] T. Sherwood, E, Perelman, G. Hamerly, and B. Cnlder. Automatically char- 
acterizing large scale program behavior. In Proceedings of the 10th Inter- 
national Conference on Architectural Support for Programming Languages 
and Operating Systems, October 2002. 

[22] J. Yang and R. Gupta. Frequent value locality and its applications. Spe- 
cial Issue on Memory Systems, ACM Transactions on Embedded Computing 
Systems, 1C1):79-105, November 2002. 



12 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 5 

TO DECLARATION OF 
JAMES A FLIGHT 



Exhibit A to response to OA of 10-10-06.txt 
From: Tim Sherwood [sherwood@cs.ucsb.edu] 
sent: Monday, October 30, 2006 4:28 PM 
To: James A. Flight 

Subject: Re: Phase Tracking and Prediction publication 



The publication appeared in isca 2003 and so was officially published on June 9th, 
2003. 



http : //cs . ny u . edu/i s ca03/ 



On Mon, 30 Oct 2006, James A. Flight wrote: 

> I am trying to properly cite to your publication "Timothy Sherwood, 

> suleyman Sair, and Brad calder. 

> <http://www-cse.ucsd . edu/~cal der/abstracts/iscA-03 -Phase . html> Phase 

> Tracking and Prediction, In the proceedings of the 30th Annual intl . 

> symposium on Computer Architecture (I5CA 2003), June 2003. san Diego, 

> California." 



> I ran across this link 

> "http : //www-cse . ucsd . edu/Di enst/UI/2 . 0/Descri be/ncst rl . ucsd_cse/CS2002-0710" 

> indicates that you actually published this one year earlier Ci-e. T on 

> June 23, 2002). Can you please confirm which is the correct date? 



> Thank you very much, 



> James A. Flight 



> 20 North Wacker Drive, Suite 4220 

> Chicago, Illinois 60606 

> C312) 580-1034 CDi rect) 

> (312) 580-1020 CMain) 

> (312) 580-9696 CFax) 

> 

> jflight@hfzlaw.com 



> important: This electronic mail message and any attached files contain 

> information intended for the exclusive use of the individual or entity 

> to whom it is addressed and may contain information that is 

> proprietary, privileged, confidential and/or exempt from disclosure 

> under applicable law. if you are not the intended recipient, please 

> notify the sender, by electronic mail or telephone, of any unintended 

page 1 



Exhibit A to response to OA of 10-10-06.txt 
recipients and delete the original message without making any copi 



page 2 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 6 
TO DECLARATION OF 
JAMES A FLIGHT 



ISCA 2003 




ISCA 2003 

The thirtieth International Symposium on Computer 
Architecture (ISCA) will be held at the Town and Country 
Hotel in San Diego 9-1 1 June, 2003. ISCA 2003 is a 
constitutent conference in the ACM Federated Computing 



Research Conference (FCRC). 

Main Page 
Final Program 
Travel and Registration 
Student Travel Grants 
Companion Travel Grants 
Workshops and Tutorials 
Call for Papers 
Organizing Committee 
Program Committee 

Allan Gottlieb 



http://cs.nyu.edu/isca03/ 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 7 
TO DECLARATION OF 
JAMES A FLIGHT 



Phase tracking and prediction 

f 



Page 1 of 6 



'© PORTAL 



My Account My Binders Logout : James Flight 
I Search: © The ACM Digital Library O The Guide 



Feedback Re port a problem Satisfaction 
survey 

Phase tracking and prediction 

Full text "j^jjEdf (674 KB) 

Source ACM SIGARCH Computer Architecture News archive 

Volume 31 , Issue 2 {May 2003) table of contents 

SESSION: Prediction table of contents 

Pages: 336 - 349 

Year of Publication: 2003 

]SSN:0163-5964 

Also published in ... 
Authors Timothy Sher wood University of California, San Diego 

Suleyman Sair University of California, San Diego 

Brad Calder University of California, San Diego 

Publisher ACM Press New York, ny, usa 

Additional Information: abstract references cited by collaborativ e coileagues peer to peer 
Tools and Actions: Find similar Articles Review this Article 

Save this Article to a Binder Display Formats: BibTex EndNote ACM Ref 



DOI Bookmark: 



Use this link to bookmark this Article: htt p://doi.acm.org/10.1145/871656.859B57 
Wh at Is , a DOI? 



* ABSTRACT 

In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye 
view of the behavior of a program at these speeds can be a difficult task when all that is available is 
cycle by cycle examination. In many programs, behavior Is anything but steady state, and 
understanding the patterns of behavior, at run-time, can unlock a multitude of optimization 
opportunities. In this paper, we present a unified profiling architecture that can efficiently capture, 
ctassify, and predict phase-based program behavior on the largest of time scaies. By examining the 
proportion of Instructions that were executed from different sections of code, we can find generic 
phases that correspond to changes in behavior across many metrics. By classifying phases 
generlcally, we avoid the need to identify phases for each optimization, and enable a unified 
prediction scheme that can forecast future behavior. Our analysis shows that our design can capture 
phases that account for over 80% of execution using less that 500 bytes of on-chip memory. 



<t* REFERENCES 

Note: OCR errors may be found In this Reference List extracted from the full text article. ACM has 
opted to expose the complete List rather than only correct and linked references. 



1 .luan L. Araoon . Jose Gonzalez . Antonio Gonzalez, Power-Aware Control Speculation 
through Selective Throttling, Proceedings of the 9th International Sy mposi um on High- 
Performgnce Computer Architecture^. 103, February _08-12 , 2003 



http ://portal.acm.org/citation.cfin?id=87 1 65 6. 85 9657&coll=Portal&dl=ACM&CFID=286. 



7/16/2007 



Phase tracking and prediction Page 2 of 6 



2 Rajeev Balasubramonlan . David Albonesi , Alper Buyuktosunoolu . Sandhya Dwarkadas, 
V Memory hierarch y reconfigurat ion for energ y and performance In general-pur pose 

processor architectures, Proceedings of the 33rd annua! ACM/IEEE Int ernational 
symposium on Microarchitecture, p. 245-257. December 2000, Monterey , California. United 
States rdoi> 10. 1145/360128. 360153 ] 

packing: extracting hardware-detected program phases for post-link o p t imiza tion ,. 
Proceedings of the 35th annual ACM/IEEE international sy m posium on Microarchite cture,, 
NnvpmhPr 1R-22. 2002. Istanbul. Turkey 

^ 4 David Brooks , Vlvek T iwarl , Margaret Martonosi r Wattch: a framework for ar chitectural - 
" level power analysis and optimizations, Proceedings of the 27th annual International 

s ymposium on Computer architecture , p.83-94. June 2000. Vancouver , British Columbia , 

Canada [dol> 10, 1145/342001.33965 71 

5 D. C. Burger and T. M. Austin. The slmplescalar tool set, version 2.0. Technical Report CS- 
TR-97-1342, U. of Wisconsin, Madison, June 1997. 

6 B. Calder, P. Feller, and A. Eustace. Value profiling and optimization. Journal of Instruction 
Level Parallelism, March 1999. 

<|b 7 Brad Calder , Gienn Reinman , Dean M. Tullse n, Selective value prediction. Proceedings of 
~ the 26th annual International sy m posium on Computer architecture , p.64-74. Ma y 01-04, 

19 99. Atlanta. Georgia. United 5tates rdoi> 10.1145/307338. 300985 1 

^ 8 I-Chen g K. Chen , John T. Coffey . Trevor IM. Mudoe , Analysis of branch prediction via jdata 
^ compression. Proceedings of the s eventh int ernational confe re n ce on Archltertuj^Uuj^ort 

for programm ing langua ges and o perating systems, p. 128-137 , October 01-04, 1996, 
Cambridge. Massachusetts, United States [d oi > 10.1145 /2 48208.237171 1 

9 A. Dhodapkar and J. E. Smith. Dynamic microarchitecture adaptation via co-designed 
virtual machines. In International Solid State Circuits Conference, February 2002. 

4^ 10. Ashutosh S. Dhoda pkar , James E. Smith , Managing multl-conflouratlon hardware via 

dyna mic working set analysis , Pr oceedings of the 29th annual int ernati onal symposium on 
Com puter architectur e, p.233. May 25-29. 2002 f Anchorage. Alaska 
[dol > 10. 1145 /545214.5452411 

11 M. Huang, J. Renau, and J. Torrelias. Profile-based energy reduction in high-performance 
processors. In 4th Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4), 
December 2001. 

12 A. Iyer . D . Mareulescu, Power aware mic roarchitecture res ource scaling . Proceedings of 
the conference on Design , automation and te st In Europe, p.190-19 6, March 2001, , 
Munich, Germany 

^ 13 Doug Jose ph , Dirk Grunwald, Prefetching us ing Markov pre dictors, Proceeding s_gf_tbe 
^ 24th annual international sy m posium on Com puter architecture._p t 252-263 r June OlzPJL 

1997, Denver. Colorado. United States [doi> 10. 1145/38, 4286.264207 ] 

14 Mlkko H. Ll pastl , Christopher B. Wilkerson . John Paul Shen , Value locality and load value 
^ prediction, Proceedings of the seventh international conference on Architectura l support 

far programming languages and operating systems , p. 138-147, October 01-04, 1996. 
Cambridge , Massachusetts. United States [dol> 10. 1145/248208. 237173 1 

15 Matthew C. Merten . Andrew R. Trick . Ronald D. Barnes. An Architectural Framework for 
Runtime O ptimization , IEEE Transactions on Comp uters, v. 50 n.6, p.567-589, June 2001 
[dol> 10.1109/12.931894 ] 



http://portal.acm.org/citation.cfM 7/16/2007 



Phase tracking and prediction. 



Page 3 of 6 



16 Markus Mock , Craig Chambe rs , Susan J. Eg gers, Caipa: a tool for automating selective 
d ynamic com pilation, Pro ceedings of the 33rd annual ACM/IEEE international sym posium 
on Microarchitectu re, p.29i-302 r December 2000 . Monterey , Californ ia, United States 

[dol > 10. 1145/360128.360158 1 

17 Robert Muth . Scott A. Watterson , Saumva K, Debray, Code S pecialization Based on Value 
Profiles , Proceedings of the 7th International Sy m posium on Static Anal ysis^. 340-359^ 
June 29-Juiy 01, 200 .0 

18 Parthasarathy Ranqanathan , Sarita Adve , Norman P. Jouppi, Reconfig ur able caches and 
their application to media processing, Proceedings of the 27th annual Int ernational 
s ymposium on Computer architecture , p. 214-224. June 2000, Vancouver , British 
Columbia, C anada [doi>ig.H45/339647.339685] 

19 T, Sherwood and B. Calder. Time varying behavior of programs. Technical Report UCSD- 
CS99-630, UC San Diego, August 1999. 

20 Timothy Sherwood . F re? Perelman , Brad Caider , Basic Blo ck Distributio n Analysis toFipd 
Periodic Behavior and Simulation Points in Applications , Procee ding s of the 20 Q1 
International Conference on Parallel Architectures and Com pilation Technigues, p.3-14, 
Se ptember 08-12, 2001 

21 Timothy Sherwood , Erez Perelman , Greg Hameriv , Brad Calder, Automatically 
characterizin g large scale program behavior . Proceedings of the 10th internationa l 
conference on Arch itectural support for programming languages and op^atLn^ystems,. 
October 05-09 , 2002 . San Jose. California [dol>10_JJL4_g/605432.605403] 

22 Jun Yanc , Raiiv G upt a, Frequent value lo cality and Its appHcattons, ACM Trans actLQJls^Jl 
Embedded Com puting S ystems fTECS) , v.l n.l, p. 79-105 , Novem ber 200 2 



* CITED BY 27 

Wonbok Lee , Kimish Patel . Massoud Pedram, B ^ Slm:: a fast micro-architecture s imulator 
^ based on basic block characterizatio n , Proceedings of the 4th international conferenc e on 
Hardware/software codeslon and system s y nth esis , October 22-25, 2006,3ejM J _Korea 

Chang-Burm Cho » Tao LI. Complexity -based program pha s e analysis and ciassjfJcatioEU 
v Proceeding s of the 1 5th internatio nal confer ence on Pa rallel architectures and compilation 
techniq ues, September 16- 20, 2006, Seattle. Washington , USA 

<^ ttaiiKhal s^nghai , Tina Su . Jennifer Dy . David Kaeii, A multinomial clustering mode l for fast 
^ simulation of com pute r ar chi t ect ure desi gns , Proceeding of th e eleventh A CM SIGKDD 

international conference on Knowledge di scovery in data m i nin g, Augu st 21-24 ,_2 005 f Chicag o,, 

Illinois. USA 

<^ Mlchela Becchi , Patrick Crowley , D ynamic thread assignment on heterogeneous m ultiprocessor 
^ arrhirfirtnr-fis, P roceedings of the 3rd conference on Computing frontiers, May 03 -05, 2006, 



Steven P. Relss. Dynamic detection and visualization of software phases. ACM 5IG5QFT 
^ ^nft-warp Fnglnpprinn Notes , v. 30 n.4. Juiv 2005 

Prlya Naopurkar . Chandra Krlntz. Visualization and analysis of phased behavior In Java 
programs, Proceed ings of t he 3rd international sy m posium on Princi ples and. practice of 
programming in Java. June 16-18. 2004. Las Vegas, Nevada 

Radu Cornea , Alex Nicolau , Nikil Putt , Software annotations for power optimization on mobile 
riPvtrps, Proceedings of the conference on Design, au to m a tion and t e st in Euro p^j^rp^eedirigs, 
March- Q fciq 2006, Munich, Germany 



http://portal.acm.org/citation^ 7/16/2007 




Phase tracking and prediction Page 4 of 6 



Anahita Shayesteh , Glenn Reinman , ,, Norm an Joup pl , Suleyman Sair , Tim Sherwood , 
Dynamically configurable shared CMP hel per engines for improved performance, ACM S JG_ARCH 
Com puter Architecture News, v. 33 n.4, November 2005 

Anahita Shayesteh , Glenn Reinman , Norm Jou p p l,, Ti m Sherwood , Suleyman Sair, Im proving 
the perfor mance and power efficiency of shared helpers In CMPs. Proceedings of the 2006 
international conferen ce on Com pilers, architecture and sy nthesis for em bedded systems, . 
22-25, 2006. Se 



Cristiano Pereira , Jerem y Lau , Brad Calder , Raj esh Gupta, Dynamic phase analysis for cycie- 
ciose trace generation . Proceedings of the 3rd IEEE/ ACM/IFI P International conference on 
Hardware/software co deslan and system s ynthesis, Septembe r 1 9-21, 2005, Jersey _CJty J _^J J . 
USA 

Chen Pino , Chenoliang Zhang , Xl peng Shen , Mitsunor i O glhara, Gat ed memor y control for 
memory monitoring, leak detection and garbage collection. Proceedings of the 2005 workshop 
on Memory system performa nce , June 12-12 , 2005, Chicago, Illinois 

John D. Davis , Cong Fu , James Laudon. The RASE (Rapid, Accura te Simulation Environments 
for chip multi processors, ACM SIGARCH Com puter Architect ure News, v.33 n.4 , November 
2005 

Kartik K. Agaram , Stephen W. Keckier . Calvin Lin , Kathryn S. McKinle y, Decomposing 
memory performance: data structure s a nd . phases, Proceedin gs of the 2006 inte rnational 
symposium on Me mory management , June 10-1 1, 20, 06 , , Ottawa, Ontario, Canada 

Dominoo Benitez . Juan C. Moure , Dolor es L Rex achs . Emilio Luq ue. Evaluation of the field- 
programmable cache: performanc e and energy consumption. Proceedings of the 3rd 
conference on Computing frontiers , May 03-05,_2006, Ischia, Italy 

P riya Nagpurkar , Chandra Krlntz, Phase-based visualizati on and analysis of Java programs. 
Science of Comput er Programming, v.59 n.1-2 , p.64-81, January 2006 

Ted Huffmire , Tim Sherwood, W avelet-based phase classification. Proceeding s of the 15th 
international conference on Pa rallel architectures and compil ation techniq ues, 3epJtemb_er_l&; 
20, 2006 r Seattle, Washington. USA 

Tip p Moselev , Alex Shve , Viiay Janapa Reddi , Matthew Iy er , Dan Fay , David Hodqdon , 
Joshua L. Kihm , Alex Settle , Dirk Grunwald , Daniel A. Connors , D ynamic run-time 
architecture techniques for enabling continuous op timization, Proc eeding s of the 2nd 
conferen ce on Com puting frontiers , May 04-06, 2005, Ischia, Italy 

Thomas Y. Yeh , Glenn Reinman, Fast and fair: ,, data-stream q uality of s ervic e , Proceeding s of 
the 2005 International conference on Compilers, architectures and synthesis for embedded 

Andy Georges , Dries Buvtaert , Lleven Eeckhout , Koen De Bosschere, Method-ievel phase 
behavior in jav a workloads, ACM SIGPL AN Notices, v.39 n.10 . October 2004 

Juan C. Moure , Domingo Benftez , Dolo res I. Rexa c hs , Emilio Luq ue,_Wide and^ffJclejiUrace 
prediction using the local trace predictor. Proceedings of the 20th an nual, international 
conference on Su percomo utino . June 28-Julv 01. 2006. Cairns, Queensland, Australia 

Trlshui M. Chlljmbj , Vinod Ganapathy, HeapMD: identifying heap-based bugs using anomaly 
detection, ACM SIGARCH Com puter Architecture News, v.34 n.5 f December 2006 

Xi peng Shen , Yutao Zhong , Chen Pino, Locality phase prediction. ACM SIGOPS Operating 
Systems Review, v. 38 n.5, December 2004, 

Xiaodono LI , Zhenmin Li , Fra n cis David , Pin Zhou , Yuanvuan Zhou , Sarita Adve , Sanieev 
Kumar. Performance directed energy management for main memory and disks, ACM SIGARCH 



http://portal.acm.org/citationxM 



7/16/2007 



Phase tracking and prediction 



Page 5 of 6 



Computer Architecture [Mews, v.32 n.5, December 2004 

A Xiao dona Li r Zhenmin Li . Yuanvuan Zhou , Sarita Adve , Performance directed energy 
^ management for main memory and disks, ACM Transactions on Storage fTOSI , v.l n.3, p. 346- 
380 f August 2Q05 

Shlwen Hu , Madhavi Valluri , Llzy Kurian 3ohn f Effective ij)anagerrj!5pt i of,, rn , u)t:l p !fi ,, cor| l figurable 
^ units using dynamic optimization, ACM Transactions on Architecture and C ode Optimization 
(TACO), v.3 n.4, p.477-501, December 2006 

Pri ya Naqpurkar , Hussam Mousa , Chandra Krintz , Timothy 5herwood f Efficient remote 
profiling for resource-constrained devices. ACM Transactions on Architecture and Code 

<^ Jedidiah R. Crandail , Gary Wassermann , Danieja A. 5. de Oiiveira , Zhendong 5u , S. Felix 
^ Wu , Frederic T. Chong, Temporal search: detecting hidden malware timebombs with virtu al 
mj^q^.^C.|yL5Jg^LAJNLNot) ces , v.41 n.ll, Nov ember 2006 



Collaborative Colleagues: 

Brad Calder: Matthew Arnold 
Todd Austin 
Todd M. Austin 
Jean-Loup Baer 
Iris Bahar 
Karan Bhatia 
Trevor Blackwell 
Larry Carter 
Lori Carter 
Andrew Chien 
Andrew A. Chien 
Trishul Chillmbi 
Welhaw Chuang 
Chandra Ckrfntz 
Robert Cohn 
Osvaldo Colavin 
Jean-Francois 
Collard 

Jamison Collins 
Lfeven Eeckhout 
Stephen Elbert 
Joel Emer 
Alan Eustace 
Peter Feller 
Jeanne Ferrante 



Suleyman Sair: 



Nikolas Gtoy 
Ruben Gonzalez 
David Grove 
Dirk Grunwald 
Rajesh Gupta 
Urs Holzle 
Greg Hamerly 
Amir H. Hashemi 
Michael Hind 
Yuanfang Hu 
Llzy K. John 
Simmi John 
Michael Jones 
Norman P. Jouppi 
David R. Kaeli 
Barbara Kreaseck 
Chandra Krintz 
Rakesh Kumar 
Jeremy Lau 
Dennis Lee 
Han Bok Lee 
Derek Lieber 
David J. Lilja 
Donald Lindsay 



James Martin 
Michael Mozer 
Satish 

Narayanasamy 
Mark Oskin 
Harish Patil 
Cristiano Pereira 
Erez Perelman 
Gilles Pokam 
Gilles Pokam 
Gilles Pokam 
Glenn Reinman 
Glenn D. Reinman 
Suleyman Sair 
Jack Sampson 
Jack Sampson 
Vlvek Sarkar 
Mike Schlansker 
Stefan 

Schoenmackers 
John Shen 
Timothy Sherwood 
Timothy Peter 
Sherwood 
Tomothy Sherwood 
Beth Simon 
James E. Smith 



Timothy 



Brad Calder 
Jamison Collins 
Jose Frldman 
Yuanfang Hu 
Norm Jouppi 
Norman Jouppi 
David Kaeli 
Youngsoo Kim 
Satish 

Narayanasamy 
Gufseppe Ollvadotl 
Banit Agrawal 



Michael D. Smith 
Amitabh 
Srlvastava 
Nathan Tuck 
Dean Tullsen 
Dean M. Tullsen 
Eric Tune 
Gary Tyson 
Michael Van 
Biesbrouck 
Michael Van 
Blesbrouk 
George Varghese 
Ganesh 
Venkatesh 
Steven Wallace 
Hong Wang 
Ju Wang 
Perry Wang 
Don Yang 
Joshua J. Yl 
Weifeng Zhang 
Weifeng Zhang 
Ben Zorn 
Benjamin Zorn 
Benjamin G, Zorn 



Glenn Reinman 
Glenn Reinman 
Anahita Shayesteh 
Tim Sherwood 
Timothy Sherwood 
Dean M. Tullsen 
George Varghese 



Ryan Kastner 



http://portal.acm.org/citation.crm?id=87 1 656.859657&coU=Portal&dl=ACM&CFID=286... 7/1 6/2007 



Phase tracking and prediction 



Page 6 of 6 



Sherwood: 



Forrest Brewer Chandra Krintz 

Brett Brotherton Jeremy Lau 

Andrew P. Brown Hua Lee 

Brad Calder Yan Meng 

Frederic T. Chong Yan Meng 

Joel Emer Farilee Mlntz 

Greg Hamerly Hussam Mousa 

Greg Hoover Priya Nagpurkar 

Ronald A. litis Sattsh 



Narayanasamy 



Erez Perelman 
Suleyman Sair 
Stefan 

Schoenmackers 
Lin Tan 
Michael Van 
Biesbrouck 
Michael Van 
Blesbrouk 
George Varghese 
Mtroslava Vomela 



^ Peer to Peer - Readers of this Article have also read: 

• D ata structures for quadtree ap proxi mation and compression Communications of the ACM 
28, 9 

Hanan Samet 

• A hierarchical singie-kev-iock access control using the Chinese remainder theorem Proceedings 
of the 1992 ACM/SIGAPP Symposium on Applied computing 

Kim S. Lee , Hulzhu Lu , D. D. Fisher 

• The Gem Stone object database management system Communications of the ACM 34, 10 
Paul Butterworth , Allen Otis , Jacob Stein 

• Putting Innovation to work: adoption strategies for multimedia communication systems 
Communications of the ACM 34, 12 

Ellen Franclk , Susan Ehrlich Rudman , Donna Cooper , Stephen Levine 

• An intellig ent co m ponent database for behavioral synthesis Proceedings of the 27th 
ACM/IEEE conference on Design automation 

Gwo-Dong Chen , Daniel D. Gajski 

<?> This Article has also been published in: 

• International Sy m posium on Computer Architecture 

Proceedings of the 30th annual international sy m posium on Com puter architecture 

The ACM Portal is published by the Association for Computing Machinery. Copyright © 2007 ACM, Inc. 
Terms of U sage Privacy Policy Code, of Ethics ConlacUJs 

Useful downloads: H) Ad obe Acro bat Q QuickTime II Windows Media Play er ^ » Real Play er 



http://portd.acm.org/citationxM 



7/16/2007 



( 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 8 
TO DECLARATION OF 
JAMES A FLIGHT 



Phase Tracking and Prediction 



Timothy Sherwood 



Suleyman San- 



Department of Computer Science and Engineering 
University of California, San Diego 
{sherwood.s sair,calder} @cs .ucsd.edu 



Abstract 

In a single second a modern processor can execute billions 
of instructions. Obtaining a bird's eye view of the behavior of a 
program at these speeds can be a difficult task when all that is 
available is cycle by cycle examination. In many programs, be- 
havior is anything but steady state, and understanding the pat- 
terns of behavior, at run-time, can unlock a multitude of opti- 
mization opportunities. 

In this paper, we present a unified profiling architecture that 
can efficiently capture, classify, and predict phase-based pro- 
gram behavior on the largest of time scales. By examining the 
proportion of instructions that were executed from different sec- 
tions of code, we can find generic phases that correspond to 
changes in behavior across many metrics. By classifying phases 
generically, we avoid the need to identify phases for each opti- 
mization, and enable a unified prediction scheme that can fore- 
cast future behavior. Our analysis shows that our design can 
capture phases that account for over 80% af execution using less 
that 500 bytes of on-chip memory. 

1 Introduction 

Modern processors can execute upwards of 5 billion instructions 
in a single second, yet most architectural features target program 
behavior on a time scale of hundreds to thousands of instruc- 
tions, less than half a fiS. While these optimizations can provide 
large benefits, they are limited in their ability to see the program 
behavior in a larger context. 

Recently there has been a renewed interest in examin- 
ing the run-time behavior of programs over longer periods of 
time [10, 11, 19, 20, 3]. It has been shown that programs can 
have considerably different behavior depending on which por- 
tion of execution is examined. More specifically, it has been 
shown that many programs execute as a series of phases, where 
each phase may be very different from the others, while still hav- 
ing a fairly homogeneous behavior within a phase. Taking ad- 
vantage of this time varying behavior can lead to, among other 
things, improved power management, cache control, and more 
efficient simulation. The primary goal of this research is the de- 
velopment of a unified run-time phase detection and prediction 
mechanism that can be used to guide any optimization seeking 
to exploit large scale program behavior. 

A phase of program behavior can be defined in several ways. 
Past definitions are built around the idea of a phase being an in- 
terval of execution during which a measured program metric is 
relatively stable. We extend this notion of a phase to include all 
similar sections of execution regardless of temporal adjacency. 
Simply put, if a phase of execution is correctly identified, there 



should only be small variations between any two execution in- 
tervals identified as being part of the same phase. A key point of 
this paper is that the phase behavior seen in any program metric 
is directly a function of the way the code is being executed. If 
we can accurately capture this behavior at run-time through the 
computation of a single metric, we can use this to guide many 
optimization and policy decisions without duplicating phase de- 
tection mechanisms for each optimization. 

In this paper, we present an efficient run-time phase tracking 
architecture that is based on detecting changes in the propor- 
tions of the code being executed. In addition, we present a novel 
phase prediction architecture that can predict, not only when a 
phase change is about to occur, but also the phase to which it 
is will transition. Since our phase tracking implementation is 
based upon code execution frequencies, it is independent of any 
individual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics allows us to consistently track phase 
information as the program's behavior changes due to phase- 
based optimizations. 

We demonstrate the effectiveness of our hardware based 
phase detection and classification architecture at automatically 
partitioning the behavior of the program into homogeneous 
phases of execution and to identify phase changes. We show 
that the changes in many important metrics, such as IPC and en- 
ergy, correlate very closely with the phase changes found by our 
metric. We then evaluate the effectiveness of phase tracking and 
prediction for value profiling, data cache reconfiguration, and 
re-configuring the width of the processor. 

The rest of the paper is laid out as follows. In Section 2, 
prior work related to phase-based program behavior is discussed. 
Simulation methodology and benchmark descriptions can be 
found in Section 3. Section 4 describes our phase tracking ar- 
chitecture. The design and evaluation of the phase predictor are 
found in Section 5. Section 6 presents several potential applica- 
tions of our phase tracking architecture. Finally, the results are 
summarized in Section 7. 

2 Related Work 

In this Section we describe work related to phase identification 
and phase-based optimization. 

In [19], we provided an initial study into the time varying 
behavior of programs, showing that programs have repeatable 
phase-based behavior over many hardware metrics — cache be- 
havior, branch prediction, value prediction, address prediction, 



IPC and RUU occupancy for all the SPEC 95 programs. Looking 
at these metrics over time, we found that many programs have 
repeating patterns, and that important metrics tend to change at 
the same time. These places represent phase boundaries. 

In [20], we proposed that by profiling only the code that was 
executed over time we could automatically identify periodic and 
phase behavior in programs. The goal was to automatically find 
the repeating patterns observed in [19], and the lengths (peri- 
ods) of these patterns. We then extended this work in [21], using 
techniques from machine learning to break the complete exe- 
cution of the program into phases (clusters) by only tracking the 
code executed. We found that intervals of execution grouped into 
the same phase had similar behavior across all the architecture 
metrics examined. From this analysis, we created a tool called 
SimPoint [21], which automatically identifies a small set of in- 
tervals of execution (simulation points) in a program to perform 
architecture simulations. These simulation points provide an ac- 
curate and efficient representation of the complete execution of 
the program. 

The work of Dhodapkar and Smith [10, 9] is the most closely 
related to ours. They found a relationship between phases and 
instruction working sets, and that phase changes occur when the 
working set changes. They propose that by detecting phases and 
phase changes, multi-configuration units can be re-configured in 
response to these phase changes. They have used their working 
set analysis for instruction cache, data cache and branch predic- 
tor re-configuration to save energy [10, 9]. 

The work we present in this paper identifies phases and phase 
changes by keeping track of the proportions in which the code 
was executed during an interval based upon the profiler used 
in [20]. In comparison, Dhodapkar and Smith [10, 9] track the 
phase and phase changes solely upon what code was executed 
(working set), without weighting the code by its frequency of 
execution. Future research is needed to compare these two ap- 
proaches. 

Additional differences between our work include our exam- 
ination of architectures for predicting phase changes, and differ- 
ent uses from [10, 9], such as value profiling and processor width 
reconfiguration. We provide an architecture that can fairly accu- 
rately predict what the next phase will be, along with predicting 
when there will be a phase change. In comparison, Dhodapkar 
and Smith do not examine phase-based prediction [10, 9], but 
concentrate on detecting when the working set size changes, and 
then reactively apply optimization. 

Merten et al. [15] developed a run-time system for dynami- 
cally optimizing frequently executed code. Then in [3], Barnes 
et al. extend this idea to perform phase-directed complier op- 
timizations. The main idea is the creation of optimized code 
"packages" that are targeted towards a given phase, with the goal 
of execution staying within the package for that phase. Barnes et 
al. concentrate primarily on the compiler techniques needed to 
make phase-directed compiler optimizations a reality, and do not 
examine the mechanics of hardware phase detection and classi- 
fication. We believe that using the techniques in [3] in conjunc- 
tion with our phase classification and prediction architecture will 
provide a powerful run-time execution environment 



f 



I Cache 


tencv 4 Wily Set " aSSDCmt,Ve • bylE b '° CkS ' ' CyC,e m " 


D Cache 


16k 4-way set-associative, 32 byte blocks, 1 cycle la- 


L2 Cache 


128& 8-way set-associative, 64 byte blocks, 12 cycle la- 


Main Memory 


120 cycle latency 


Branch Pred 


hybrid - 8-bit gshare w/ 2k 2-bit predictors + a 8k bi- 
modal predictor 


O-O-O Issue 


out-of-order issue of up to 4 operations per cycle, 64 en- 
try re-order buffer 


Mem Disambig 


loud/store queue, loads muy execute when all prior store 
nrlrfrpwes nre Icnnwn 


Registers 


32 integer, 32 floating point 


Func Units 


2-integer Ai-U, 2-load7store units, l-HP odder, 1 -integer 
MULT7DIV. t-FPMULT/DIV 


Virtual Mem 


HK. byte pages, 3U cycle lixed TLB miss latency alter 
earlier-issued instructions complete 



Table 1: Baseline Simulation Model. 



3 Methodology 

To perform our study, we collected information for ten 
SPEC 2000 programs applu, apai, art, bzip, f acerec, 
galgel, gec, gzip, mcf , and vpr all with reference inputs. 
All programs were executed from start to completion using Sim- 
pleScalar [5] and Wattch [4]. Because of the lengthy simulation 
time incurred by executing all of the programs to completion, 
we chose to focus on only 10 programs. We chose the above 
10 programs since their phase based behavior represents a rea- 
sonable snapshot of the SPEC 2000 benchmark suite, along with 
picking some of the programs that showed the most interesting 
phase-based behavior. Each program was compiled on a DEC 
Alpha AXP-21164 processor using the DEC C, and FORTRAN 
compilers. The programs were built under OSF/1 V4.0 operating 
system using full compiler optimization (-04 - if o). 

The timing simulator used was derived from the Sim- 
pleScalar 3.0 tool set [5], a suite of functional and timing simu- 
lation tools for the Alpha AXP ISA. The baseline microarchitec- 
ture model is detailed in Table 1. In addition to this, we wanted 
to examine energy usage optimizations, so we used a version of 
Wattch [4] to capture this information. We modified all of these 
tools to log and reset the statistics every 10 million instructions, 
and we use this as a base for evaluation. 

4 Phase Capture 

In this section we motivate the occurrence of phase-based behav- 
ior, describe our architecture for capturing it, and examine the 
accuracy of using the program behavior in our phase-tracking 
architecture to identify phase changes for various hardware met- 
rics. 

4.1 Phase-Based Behavior 

The goal of this research is to design an efficient and general pur- 
pose technique for capturing and predicting the run-time phase 
behavior of programs for the purpose of guiding any optimiza- 
tion seeking to exploit large scale program behavior. Figure 1 
helps to motivate our approach to the problem. This figure shows 
the behavior of two programs, gec and gzip, as measured by 
various different statistics over the course of their execution from 
start to finish. Each point on the graph is taken over 10 mil- 
lion instructions worth of execution. The metrics shown are the 




^ 2E+06 - A . 



rmJrH t . 



^.U,-fJ0 



B OOOOO 
M 600000 
5 400D00 

200000 
0 

>;1E+09 
•9 5E+0B 



1.5E+06 
1E+06 
500DOO 



3 BOOOO 
5 4D00D 
□20000 











1 1 














1 , .h 




































1 1 1 . 1 . . 1 



20B 30B 40B 

Figure 1: To illustrate the point that phase changes happen across many metrics all at the same time, we have plotted the value 
of these metrics over billions of instructions executed for the programs gcc (shown left) and gzip (shown right). Each point on 
the graph is an average over 10 million instructions. The number of unified L2 cache misses (ul2), the energy consumed by the 
execution of the instructions, the number of instruction cache (ill) misses, the number of data cache misses (dll), the number of 
branch mispredictions (bpred) and the average IPC are plotted. 



number of unified L2 cache misses (ul2), the energy consumed 
by the execution of the instructions, the number of instruction 
cache (ill) misses, the number of data cache misses (dll), the 
number of branch mispredictions (bpred) and the average IPC. 
The results show that all of the metrics tend to change in unison, 
although not necessarily in the same direction. In addition to 
this, patterns of recurring behavior can be seen over very large 
time scales. 

As can be seen from these graphs, even at a granularity of 10 
million instructions (which is at the same time scale as operating 
system time slices) there can be wildly different behavior seen 
between intervals. In this paper, we concentrate on a granularity 
of 10 million instructions because it is both outside the scope 
of normal architectural timing and is small enough to allow for 
many complex phase behaviors to be seen. 

4.2 Tracking Phases by Executed Code 

Our phase tracker architecture operates at two different time 
scales. It gathers profile information very quickly in order to 
keep up with processor speeds, while at the same time it com- 
pares any data it gathers with information collected over the long 
term. Additionally, it must be able to do all that while still being 
reasonable in size. 

Our phase profile generation architecture can be seen in Fig- 
ure 2. The key idea is to capture basic block information during 
execution, while not relying on any compiler support Larger 
basic blacks need to be weighed more heavily as they account 
for a more significant portion of the execution. To approximate 
gathering basic block information, we capture branch PCs and 
the number of instructions executed between branches. The in- 
put to the architecture is a tuple of information: a branch iden- 
tifier (PC) and the number of instructions since the last branch 



PC was executed. This allows us to roughly capture each basic 
block executed along with the weight of the basic block in terms 
of the number of instructions executed, as we did in [20, 21] for 
identifying simulation points. 

Classifying phases by examining only the code that is ex- 
ecuted allows our phase tracker to be independent of any in- 
dividual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics is also very important to allow us to 
consistently track phase information as the program's behavior 
changes due to phase-based optimizations. 

At this point it is worth making more explicit the differences 
between our technique and that of Dhodapkar and Smith [10, 9]. 
Dhodapkar and Smith use a bit vector to track the working set of 
the code for a particular interval. While our technique is based 
on the basic block vectors used in [20]. The bit vectors of Dho- 
dapkar and Smith track a metric that is related to which code 
blocks were touched, whereas our metric tracks the proportion 
of time spent executing in each code block. This is a subtle but 
important distinction. We have found that in complex programs 
(such as gcc and gzip) there are many instructions blocks that 
execute only intermittently. When tracking the pure working set, 
these infrequently executed blocks can disguise the frequently 
executed blocks that dominate the behavior of the application. 
On the other hand, by tracking the frequency of code execution 
it is possible to distinguish important instructions (basic blocks) 
from a sea of infrequentiy executed ones. Examining these dif- 
ferences in more detail is a topic of future research. 

Another advantage of tracking the proportions in which the 
basic blocks are executed is that we can use this information to 



Accumulator Past Footprints 




Figure 2: Our phase classification architecture. Each branch PC 
is captured along with the number of instructions from the last 
branch. The bucket entry corresponding to a hash of the branch 
PC is incremented by the number of instructions. After each 
profiling interval has completed, this information is classified, 
and if it is found to be unique enough, stored in the past footprint 
table along with its phase ID, 

identify not only when different sections of code are executing, 
but also when those sections of code are being exercised differ- 
endy. A simple example is in a graphics manipulation program 
running a parameterized filter on an input image. If you run a 
simple 3x3 blur filter on an image you get very different behavior 
than if you run a 7x7 filter on the same image despite the fact that 
the same filter code is executing. The 7x7 filter will have many 
more memory references and those memory references conflict 
very differentiy in the cache than in the 3x3 case. We have seen 
this very behavior in examining the interactive graphics program 
xv. Using the proportion of execution for each basic block can 
distinguish these differences, because in the 3x3 filter the head 
of the loop is called more than twice as frequently as in the 7x7 
filter. 

The same general idea applies to other data structures as 
well. Take for example a linked list. As the number of nodes in 
the linked list traversal changes over different loop invocations, 
the number of instructions executed inside the loop versus the 
time spent outside the loop also changes. This behavior can be 
captured when including a measure of the proportion of the code 
executed, and this can distinguish between link list traversals of 
different lengths. 

4.3 Capturing the Code Profile 

To index into the accumulator table in Figure 2, the branch PC 
is reduced to a number from 1 to Nbuckets using a hash func- 
tion. We have found that 32 buckets is sufficient to distinguish 
between different phases even for some of the more complex 
programs such as gcc, A counter is kept for each bucket, and 
the counter is incremented by the number of instructions from 
the last branch to the current branch being processed. Each ac- 
cumulator table entry is a large (in this study 24-bit), saturating 
counter, which will not saturate during our profiling interval of 
10 million instructions. Updating the accumulator table is the 
only operation that needs to be performed at a rate equivalent to 



the processor's execution of the program (once for every branch 
executed). In comparison, the phase classification described be- 
low needs to only be performed once every 10 million instruc- 
tions (at the end of each interval), and thus is not nearly as per- 
formance critical. 

We note that the hashing function we use is fundamentally 
the same as the random projection method we used to generate 
phases in [21]. In this prior work, we make use of random pro- 
jections of the data to reduce the dimensionality of the samples 
being taken. A random projection takes trace data in the form of 
a matrix of size LxB, where L is the length of the trace and B is 
the number of unique basic blocks, and multiplies it by a random 
matrix of size B x N, where N is the desired dimensionality of 
the data which is much smaller than B. This creates a new ma- 
trix of size L x JV, which has clustering properties very similar 
to the original data. The random projection method is a powerful 
technique when used with clustering algorithms, and for captur- 
ing phase behavior as we showed in [21]. The hashing scheme 
we use in this paper is essentially a degenerate form of random 
projection that makes a hardware implementation feasible while 
still having low error. If all the elements of the random projec- 
tion matrix consist of either a 0 or a 1, and they are placed such 
that no column of the matrix contains more than a single 1 , then 
the random projection is identical to this simple hashing mech- 
anism. We have designed our phase classification architecture 
around this principle. 

Figure 3 shows the effect of applying the above mentioned 
technique for capturing the phase behavior of the integer bench- 
mark gzip. The x-axis of the figure is in billions of instructions, 
as is the case in Figure 1. Each point on the y-axis represents an 
entry of the phase tracker's accumulator table. Each point on the 
graph corresponds to the value of the corresponding accumulator 
table entry at the end of a profiling interval. Dark values repre- 
sent high execution frequency, while light values correspond to 
low frequency. The same trends that were seen in Figure 1 for 
gzip can be clearly seen in Figure 3. In both of these figures, 
when observing them at the coarsest granularity, we can see that 
there are at least three different phases labeled A, B and C. In 
Figure 3, the phase tracker table entries 2, 5 , 7 , 13 and 
17 distinguish the two identical long running phases labeled A 
from a group of three long running phases labeled C. Phase table 
entries 12 and 20 clearly distinguish phase B from both A and 
C. This figure is pictorial evidence that the phase tracker is able 
to break the program's execution into the corresponding phases 
based solely on the executed code, and that these phases corre- 
spond to the behavior seen across the different program metrics 
in Figure 1. 

4.4 Forming a Footprint 

After the profiling interval has elapsed, and branch block infor- 
mation has been accumulated, the phase must then be classified. 
To do this we keep a history of past phase information. 

If we fix the number of instructions for a profiling interval, 
then we can divide each bucket by this fixed number to get the 
percentage of execution that was accounted for by all instruc- 
tions mapped to that bucket However, we do not need to know 
the exact percentages for each bucket Instead of keeping the 



3 















0 


B 50B 100B 



Figure 3: Visualization of the accumulator table used to track 
program behavior for gzip. The x-axis is in billions of instruc- 
tions, white the y-axis is the entry of the accumulator table. Each 
point on the graph corresponds to the value of the accumulator 
table at the end of a profiling inten>al where dark values corre- 
spond to more heavily accessed entries. The same trends that 
were seen in Figure J can be clearly seen in Figure 3. 

full counter values, we can instead compress phase information 
down to a couple of the most significant bits. This compressed 
information will then be kept in the Past Footprint table as shown 
in Figure 2. 

The number of counter value bits that we need to observe is 
related to Nbuckets. As we increase the number of buckets, the 
data is spread over more buckets (table entries), making for less 
entries per bucket (better resolution) but at the cost of more area 
(both in terms of number of buckets and more bits per bucket). 
To be on the safe side, we would like any distribution of data into 
buckets to provide useful information. To achieve this we need 
to ensure that even if data is distributed perfectly evenly over 
all of the buckets, we would still record information about the 
frequency of those buckets. This can be achieved by reducing 
the accumulator counter by: 

[bucket\i] x Nbuckets) / (intervalsize) 

If the number of buckets and interval size are powers of two, 
this is a simple shift operation. For the number of buckets we 
have chosen (32), and the interval size we profile over, this re- 
duces the bucket size to 6 bits, and thus requires 24 bytes of stor- 
age for each unique phase in the Past Footprint table. In practice 
we see that the top 6 bits of the counter are more than enough 
to distinguish between two phases. In the worst case, you may 
need one or two more bits to reduce quantization error, but in 
reality we have not seen any programs that cause this to be an 
issue. 

If too few buckets are used, aliasing effects can occur due 
to the hashing function, where two different phases will appear 
to have very similar Footprints. Therefore, we want to use a 
sufficiently large number of buckets to uniquely identify the dif- 
ferences in code execution between phases, while at the same 
time use only a small amount of area. 

To examine the aliasing effect and determine what the appro- 
priate number of buckets should be, Figure 4 shows the sum of 
the differences in the bucket weights found between all sequen- 
tial intervals of execution. The y-axis shows the sum total of 
differences for each program. This is calculated by summing the 




Number of Counters 

Figure 4: The percent difference found between Footprints from 
sequential intervals of execution, when varying the number of 
counters used to represent the footprints. The results are nor- 
malized to the difference between intervals found when having 
an infinite number of buckets to represent the footprint; 32 buck- 
ets captures most of the benefit. 

differences between the buckets captured for interval i and i — 1 
for each interval i in the program. The x-axis is the number of 
distinct buckets used. All of the results are compared to the ideal 
case of using an infinite number of buckets (or one for each sep- 
arate basic block) to create the Footprint. On the program gcc 
for example, the total sum of differences with 32 buckets was 
72% of that captured with an infinite number of buckets. In gen- 
eral we have found that 32 buckets was enough to distinguish 
between two phases. 

4.5 Classifying a Footprint to a Phase ID 

After reducing the vector to form a fooq>rint, we begin the clas- 
sification process by comparing the footprint to a set of repre- 
sentative past footprint vectors. We compare the current vector 
to each vector in the table. The next section details how we per- 
form the comparison and determine what a match is. If there is 
a match, we classify the profiled section of execution into the 
same phase as the past footprint vector, and the current vector 
is not inserted into the past footprint table. If there is no match, 
then we have just detected a new phase and hence must create a 
new unique phase ID into which we may classify it. This is done 
by choosing a unique phase ID out of a fixed pool of IDs. When 
allocating a new phase ID, we also allocate a new past footprint 
entry, set it to the current vector, and store with that entry the 
newly allocated phase ID. This allows future similar phases to 
be classified with the same ID. In this way only a single vector 
is kept for each unique phase ED, to serve as a representative of 
that phase. After a phase ID is provided for the most recent in- 
terval, it is passed along to prediction and statistic logging, and 
the phase identification part of our algorithm is completed. 

To examine the number of phase IDs we need to track, Fig- 
ure 5 shows the percentage of execution that can be accounted 
for by the tap p phases, where p is shown on the x-axis. Re- 
sults are graphed for the programs that had the min (galgel) 
and max (art) coverage, gcc, gzip, and the overall average. 
These results show that most of the program's phase behavior 
can be captured using a relatively small number of phase IDs. 



Figure 5: Results of the minimum number of pliases that need 
to be captured versus the amount program execution they cover. 
The y-axis is the percent of program execution that is covered. 
The x-axis is the minimum number of phases needed to capture 
that much program execution. 

If we only track and optimize for the top 20 phases in each ap- 
plication, we will capture and be able to accurately apply phase 
prediction/optimizations to over 90% of the program's execution 
on average. In the worst case (min), we are able to optimize most 
of the program (over 80%) by only targeting a small number (20) 
of important recurring phases. 

4.5.1 Finding a Match 

We search through the Footprint histories to find a match, but 
this query is complicated by the fact that we are not necessar- 
ily searching for an exact match. Two sections of execution that 
have very similar footprints could easily be considered a match, 
even if they do not compare exactly. To compare two vectors 
to one another, we use the Manhattan distance between the two, 
which is the element-wise sum of the absolute differences. This 
distance is used to determine if the current interval should be 
classified as the same phase ED as one of the past footprint inter- 
vals. 

If we set the distance threshold too low, the phase detection 
will be overly sensitive, and we will classify the program into 
many, very tiny phases which will cause us to lose any bene- 
fit from doing run-time phase analysis in the first place. If the 
threshold is too high, the classifier will not be able to distinguish 
between phases with different behavior. To quantify this effect, 
we examine how well our hardware technique classifies phases 
for a variety of thresholds compared to the phases found by the 
off-line clustering algorithm used in SimPoint [21]. 

The SimPoint tool is able to make global decisions to opti- 
mize the grouping of similar intervals into phases. The off-line 
algorithm makes no use of thresholds, instead its decisions are 
based solely on the structure found in the distribution of pro- 
gram behaviors. Our technique must be far more simplistic be- 
cause it must be performed on-line and with limited computa- 
tional overhead. This reduction in complexity comes at the cost 
of increased error. 

The Different Phases line in Figure 6 shows the ability of 
our hardware technique to find phase changes (transitions be- 
tween one phase and the next) when different thresholds are used 



Figure 6: Results showing how well our hardware phase tracker 
classifies two sequential intervals of execution as being from 
"Different" or the "Same" phase of execution. The percent of 
misclassifications are shown in comparison to (he phase classi- 
fications found using the off-line clustering SimPoint tool [21 ]. 

to perform the phase classification. For example, when using a 
Manhattan distance of 1 million as our threshold (shown as 20 
on our x-axis because it is in log 2 ), our hardware technique iden- 
tified 80% of the phase changes that occurred in the more com- 
plex off-line SimPoint analysis. Conversely, 20% of the phase 
changes were incorrectly classified as having the same phase ID 
as the last interval of execution. 

Likewise, the Same Phases line in Figure 6 represents the 
ability of our hardware technique to accurately classify two se- 
quential intervals as being part of the same phase as a function 
of different thresholds (again as compared to the off-line cluster- 
ing analysis). For example, when using a Manhattan distance of 
1 million (shown as 20 on the x-axis), our hardware technique 
identified 80% of the intervals that stayed in the same phase 
as correctly staying in the same phase, but 20% of those inter- 
vals were classified as having a different phase ID from the prior 
phase. 

A misclassification occurs when two sequential intervals of 
execution are classified as being in the same phase or in different 
phases using our hardware approach when the off-line clustering 
analysis tool found the opposite for these two intervals. 

If we are too aggressive and our hardware phase analysis in- 
dicates that there are phase changes when there are actually no 
noticeable changes in behavior, then we will create too many 
phase IDs that have similar behavior. This can create more over- 
head for performing phase-based optimization. On the other 
hand T if we are too passive in distinguishing between different 
phases, we will be missing opportunities to make phase specific 
optimizations. 

In order to strike a balance between having a high capture 
rate and reducing the percent of false positives, we chose to use 
a threshold of 1 million. When comparing this with the interval 
size of 10 million instructions, this means that a difference in the 
phase behavior will be detected if 10% of the executed instruc- 
tions are in different proportions. In choosing 1 million, we have 
on average a 20% misclassification rate. Note, that a misclassi- 
fication does not necessarily mean that an incorrect optimization 



will be performed. For example, if we have a "Same Phase" mis- 
classification (the two intervals were really from the same phase, 
but were classified into different phases), then a phase change is 
observed using our hardware technique when there was not one 
in the baseline classifier. If the two hardware detected phases 
have the same optimization applied to them, then this misclassi- 
fication can have no effect 

4.6 Per-Phase Performance Metric Homogeneity 

Using the techniques presented above, we can perform phase 
classification on programs at run-time with little to no impact 
on the design of the processor core. One of the goals of phase 
classification is to divide the program into a set of phases that are 
fairly homogeneous. This means that an optimization adapted 
and applied to a single segment of execution from one phase, 
will apply equally well to the other parts of the phase. In order to 
quantify the extent to which we have achieved this goal, we need 
to test the homogeneity of a variety of architectural statistics on 
a per-phase basis. 

Figure 7 shows the results of performing this analysis on the 
phases determined at run-time. Due to space constraints we only 
show results for two of the more complicated programs gcc and 
gzip. For both programs, a set of statistics for each phase is 
shown, The first phase that is listed (separated from the rest) as 
full, is the result of classifying the entire program into a single 
phase. The results show that for gcc for example, the average 
IPC of the entire program was 1.32, while the average number 
of cache misses was 445,083 per ten million instructions. In 
addition to just the average value, we also show the standard 
deviation for that statistic. For example, while the average IPC 
was 1.32 for gcc, it varied with a standard deviation of over 
43% from interval to interval. If the phase-tracking hardware is 
successful in classifying the phases, the standard deviations for 
the various metrics should be low for a given phase ID. 

Underneath the phase marked full are the five most fre- 
quently executed phases from the program as identified by our 
phase tracker. The phases are weighted by the percentage of the 
program's executed instructions they account for. For gcc, the 
largest phase accounts for 18.5% of the instructions in the entire 
program and has an average IPC of 0.61 and a standard devi- 
ation of only 1.6% (of 0.61). The other top four phases have 
standard deviations at or below this level, which means that our 
technique was successful at dividing up the execution of gcc 
into large phases with similar execution behavior with respect to 
IPC. Note, that some metrics for certain phases have a high stan- 
dard deviation, but this occurs for architecture features/metrics 
that are unimportant for that phase. For example, the phase that 
occurs for 7.2% of execution in gcc has only 75 LI instruction 
cache misses on average. This is an LI miss rate of 0.00075%, 
so an error of 215% for this metric will not likely have any effect 
on the phase. 

When we look at the energy consumption of gcc, it can be 
observed that energy consumption swings radically (a standard 
deviation of 90%) over the complete execution of the program. 
This can be seen visually in Figure I, which plots the energy 
usage versus instructions executed. However, after dividing the 
program into phases, we see that each phase has very little vari- 



ation within itself. All have less than 2% standard deviation. 
By analyzing gcc it can also be seen that the phase partitioning 
does a very good job across all of the measured statistics even 
though only one metric is used. This indicates that the phases 
that we have chosen are in some way representative of the actual 
behavior of the program. 

5 Phase Prediction 

The prior section described our phase tracking architecture, and 
how it can be used to classify phases. In this section we focus on 
using phase information to predict the next phase. For a variety 
of applications it is important to be able to predict future phase 
changes so that the system can configure for the code it will soon 
be executing rather than simply reacting to a change in behavior. 

Figure 8 shows the percentage interval transitions that are 
changes in phase, for our set of benchmarks. For all of these pro- 
grams, phase changes come quite often, but it should be noted 
that this statistic alone cannot gauge the complexity of the pro- 
gram behavior. The program gcc switches less than 10% of 
the time but switches between many different phases. The other 
extreme is art which switches almost half the time, but it is 
only switching between a few distinct phases. In this case, large 
repeating patterns can be observed. No two phases executing se- 
quentially are that similar, but there is an order to the sequence. 
By adding in a prediction scheme for these cases, we not only 
take advantage of stable conditions as in past research, but actu- 
ally take advantage of any repeating patterns in program behav- 
ior. 

5.1 Markov Predictor 

The prediction of phase behavior is different from many other 
systems in which hardware predictors are used. Because of this 
new environment, a new type of predictor has the potential to 
perform better than simply using predictors from other areas of 
computer architecture (branch and address prediction, memory 
disambiguation, etc.). 

After observing the way that phases change, we determined 
that two pieces of information are important. First, the set of 
phases leading up to the prediction are very important, and sec- 
ond, the duration of execution of those phases is important. 

A classic prediction model that is easily implementable in 
hardware is a Markov Model. Markov Models have been used 
in computer architecture to predict both prefetch addresses [13] 
and branches [8] in the past. The basic idea behind a Markov 
Model is that the next state of the system is related to the last set 
of states. 

The intuition behind this design is that phase information 
tends to be characterized by many sections of stable behavior 
interspersed with abrupt phase changes. The key is to be able to 
predict when these phase changes will occur, and to know ahead 
of time what phase they will change to. The problem is that the 
changes are often preceded by stable conditions, and if we only 
consider the last couple of intervals we will not be able to tell 
the difference between sections of stable behavior that precede 
a phase change, and those sections that will continue to be sta- 
ble. Instead, we need a way of compressing down stable phase 





phase 


IPC 


{stddev) 


bpred 


(stddev) 


d!1 


(stddev) | 


111 


(stddev) | energy 


(stddev)] ul2 


(stddev) 




full 1,32 






(135.5%) 




(110.7%)l 












18.5% 


0.61 


(1.6%) 


34665 


(22.0%) 


753382 


(5.4%)l 125091 


(23 2%) I 1+03E+09 


(1 .8%)| 395997 


— (5 3%) 


o 


18.1% 


1.95 


(0.3%) 


13048 


(3.9%) 


28112 


(15.1%)| 


43 


(73.9%) 3.22E+0B 


(0.2%) '1006 


(5.6%) 


O) 


7.2% 


0.64 


(0.2%) 


B43 


(15.1%) 


885081 


(0.1%) 


75 


(21 5.5%^) 9.78E+08 


(0.3%) 443655 


(0.1%) 




4.0%! 


1.49 


(1.2%) 


10145 


(7.6%) 


703554 


(6.B%) 


15591 


(5.2%) 4.20E+OB 


(1.1%) 354084 


(7.0%) 




3.9% 


1.76 


m 


2015 


(13.6%) 


98947 


(5.9%)| 


102 


(45J%jm57E4£B 




(12.6%) 




full 


1.33 


(16.3%) 


56045 


(11.1%) 


90446 


(58.2%)l 


60 


(1 38.1 %)| 47B2E+0B 


(13.5%)' 22880 


(112.0%) 


Q. 


17.1% 


1.24 


(3.4%) 


533D0 (10.8%) 


969G0 


(10.1%)! 


12 


(44.2%) 1 5.05E+0B 


(3.5%) 1 24218 


(8.6%) 


N 


9.4% 


1.23 


(3.8%) 


54973 


(11.5%) 


99523 


(11.3%) 


11 


(45.5%) 5.09E+0B 


(3.8%) 24518 


(9.3%) 


m 


a.B% 


1.76 (0.6%) 


56449 


(4.B%) 


37331 


(5.6%) 


241 


(B.4%) 3.55E+0B 


(0.6%) 5617 


(15.6%) 




8.0% 


1.22 (4.3%) 


54791 


(6.B%) 


99671 


(11.9%) 


4D 


(25.7%) 5.14E+0B 


(4.4%) 28153 


(11.0%) 




7.4% 


1.24 


(3.1%) 


55215 (11.1%) 


96701 


(9.6%)| 


12 


(35.4%)| 5.04E+0B 


(3.2%) I 23701 


(B.4%) 



Figure 7: Examination ofper-phase homogeneity compared to the program as a whole (denoted by f u.11). For the two programs 
and each of the top 5 phases of each program, we show the average value of each metric and the standard deviation. The name 
of the phase is the percent of execution that it accounts for in terms of instructions. Tfiese results show that after dividing up the 
program into phases using our run-time scheme the behavior within each phase is quite consistent. 



£ 80%- 



-I 



Markov Table 



Figure 8: The percent of execution intervals that transition to 
a different phase from the prior execution interval's phase as 
found by our phase tracking architecture with 32 footprint coun- 
ters using a J million Manhattan threshold. 

information into a piece of information that we can use as state. 

5.2 Run Length Encoding Markov Predictor 

To compress the stable state we use a Run Length Encoding 
(RLE) Markov predictor. The basic idea behind the predictor 
is that it uses a run-length encoded version of the history to in- 
dex into a prediction table. The index into the prediction table is 
a hash of the phase identifier and the number of times the phase 
identifier has occurred in a row. 

Figure 9 shows our RLE Markov Phase ID prediction archi- 
tecture. The the lower order bits of the hash function provide an 
index into the prediction table, and the higher order bits of the 
hash function provide a tag. When there is a tag match, the phase 
ID stored in the table provides a prediction as to the next phase 
to occur in execution. When there is a tag miss, the prior phase 
ID is assume to be the next phase ED to occur in the program's 
execution. We found that predicting the last phase ID to be 75% 
accurate on average. 

We only update the predictor when there is (1) a change in 
the phase ID, or (2) when there is a tag match. We only insert an 
entry when there is a phase ID change, since we want to predict 








tug 


ID 



















Figure 9: Phase Prediction Architecture for the Run Length En- 
coded (RLE) Markov predictor. The basic idea behind the pre- 
dictor is that two pieces of information are used to generate the 
prediction, the phase id that was just seen, and the number of 
times prior to now that it has been seen in a row. The index into 
the prediction table is a hash of these two numbers. 

when the phase is going to change. Execution intervals where 
the same phase ID occurs several times in a row do not need 
to be stored in the table, since they will be correctly predicted 
as "last phase ID", when the there is a table miss. This helps 
table capacity constraints and avoids polluting the table with last 
phase predictions. For the second update case, when there is 
a tag match, we update the predictor because the observed run 
length may have potentially changed. 

5.3 Predictor Comparison 

We compare our RLE Markov phase predictor with other pre- 
diction schemes in Figure 10. This Figure has four bars for ev- 
ery program, and each bar corresponds to the prediction accu- 
racy of a prediction architecture. The first and simplest scheme, 
Last Phase, simply predicts that the next phase is the same as 
the current phase, in essence always predicting stable operation. 
The prediction accuracy of this scheme is inversely proportional 
to the rate at which phases change in a given benchmark. For 
the program gzip for example, there are long periods of execu- 
tion where the phase does not change, and therefore predicting 
no-change does exceptionally well. 

In order to insure that we were not simply providing an 



50%- 




Figure JO: Phase ID Prediction Accuracy. This figure shows 
how well different prediction schemes work. The most naive 
scheme, last, simply predicts that the phases never change. 
The bars marked Markov and RLE Markov show how well 
we can predict the phase identifiers if we use a Markov predic- 
tion scheme with a Markov table size of 256 entries. 

expensive filter for noise in the phase IDs, we also compared 
against a simple noise filter which works by predicting that the 
next phase will be the most commonly occurring of the last three 
phases seen. This is not shown, as it actually performed worse 
on all of the programs. 

Additionally we wanted to examine the effect of a simple 
Markov model predictor for history lengths of 1 and 2. The 
Markov model predictor does a better job of predicting phase 
transitions than Last Phase, but it is limited by the fact that 
long runs will always be predicted as infinitely stable due to the 
history filling up. However, it is still very effective for f acerec 
and applu, but does not provide much benefit for either art or 
galgel. 

The final bar, RLE Markov, is our improved Markov pre- 
dictor which compresses stable phases into a tuple of phase 
id and duration. All of the Markov predictors simulated had 
256 entries taking up less than 500 bytes of storage. Using 
RLE Markov outperforms both the Last Phase and tradi- 
tional Markov on all the benchmarks. It performs especially 
well compared to other schemes on both applu and art. Over- 
all, using a Run-Length Encoded Markov predictor can cut the 
phase mispredictions down to 14% on average. 

6 Applications 

This section examines three optimization areas in which a phase- 
aware architecture can provide an advantage. We begin by ex- 
amining the relationship between phase behavior and value lo- 
cality. We then demonstrate ways to reduce processor energy 
consumption by adjusting the aggressiveness of the data cache 
and the instruction front end. 

6.1 Frequent Value Locality 

Prior work on value predictors has shown that there is a great 
deal of value locality in a variety of programs [14, 7]. Recently, 
researchers have started to take advantage of frequently loaded 



values for the purpose of optimizing caches. For example, Yang 
and Gupta [22] proposed a data cache organization that com- 
presses the most frequently used program values in order to save 
energy. Another way of exploiting value locality is through value 
specialization, which can be done either statically or dynami- 
cally [6, 17, 16] to create specialized versions of procedures or 
code-regions based upon the values frequently seen. These tech- 
niques are built on the idea of finding the most frequent values 
for loads over the whole program, and then specializing the pro- 
gram to those frequent values. 

We examine the potential of capturing frequent values on a 
per-phase basis and compare this to the frequent values aggre- 
gated over the entire program, as would be used in value code 
specialization [6]. To perform this experiment we first gathered 
the top 16 values that were loaded over the complete execution 
of the program and stored them into a table. We then examined 
the percentage of executed loads that found their loaded value in 
this table. This result is shown as Static in Figure 11. While 
significant portions of some programs are covered by just these 
few top values (such as applu), over half of the programs have 
less than 10% of their loaded values covered by these top values. 

The question is: can we do better by exploiting hardware- 
detected phase information? To answer this question we take the 
top 16 values for each phase, as detected by the hardware phase 
tracker. These top values will be shared across a single phase 
even if it is split into two or more different sections of execution. 
Each load in the program is then checked against the top val- 
ues for its corresponding phase. The Phase Coverage bar 
in Figure 1 1 shows the percent of all load values in the program 
that were successfully matched to it's per-phase top value set. 

Without any notion of loads or values, our method of divid- 
ing up phases is very successful at assisting in the search for fre- 
quent values. By just tracking the top 16 values of each phase, 
we are able to capture the values from almost 50% of the exe- 
cuted loads on average. The Perfect bar shows percentage of 
loads covered if one captures the top 16 load values for each and 
every interval (i.e., 10 million instructions) separately. This is in 
effect the best that we could hope to achieve for an interval size 
of 10 million instructions, because the 16 entries in the value ta- 
ble are custom crafted for each interval individually. As shown 
in Figure 11 , the phase-tracker compares favorably with the op- 
timal coverage. Two thirds of the total possible benefit from 
per-interval value locality can be captured by per-phase value 
locality. It is important to point out this graph by itself is not 
a good indicator of usefulness as near perfect coverage could 
be achieved simply by making every interval a separate phase. 
However, as shown in Figure 5 only a few phases (around 20) 
are used to cover at least 80% of the program's execution. 

6.2 Dynamic Data Cache Size Adaptation 
In a modern processor a significant amount of energy is con- 
sumed by the data cache, but this energy may not be put to 
good use if an application is not accessing large amounts of data 
with high locality. To address this potential inefficiency, previ- 
ous work has examined the potential of dynamically reconfigur- 
ing the data caches with the intention of saving power. In [2], 
Balasubramonian et. al. present two different schemes with 




1=1 Optimal Coverage 

Phase Coverage 
Sialic Coverage 



sialic uovBrags i 

Slit 



0 applu 2 art 4 facerec 6 ggc 8 mcf 

1 apsi 3 bztp2 5 galgel 7 gzip 9 vpr 



a k ff i. 





6 


^ — , — , — , — r 


O Small Cache 
A Phase Aware 



Figure 11: The percent of the program's load values that are 
found in a table of the most frequently values loaded over the 
whole program (Static Coverage), on a per-phase basis (Phase 
Coverage), and on a per execution interval basis (Optimal Cov- 
erage). 

which re-configuration may be guided. In one scheme, hard- 
ware performance counters are read by re-configuration software 
every hundred thousand cycles. The software then makes a de- 
cision based on the values of the counters. In another scheme, 
re-configuration decisions are performed on procedure bound- 
aries instead of at fixed intervals. To reduce the overhead of re- 
configuration, software to trigger re-configuration is only placed 
before procedures that account for more than a certain percent- 
age of execution. 

Another form of re-configurable cache that has been pro- 
posed dynamically divides the data cache into multiple parti- 
tions, each of which can be used for a different function such 
as instruction reuse buffers, value predictors, etc [18]. These 
techniques can be triggered at different points in program exe- 
cution including procedure boundaries and fixed intervals. The 
overhead of re-configuration can be quite large and making these 
policy decisions only when the large scale program behavior 
changes, as indicated by phase shifts in our hardware tracker, 
can minimize overhead while guaranteeing adequate sensitivity 
to attain maximum benefit. 

We examined the use of phase tracking hardware to guide an 
energy aware, re-sizable cache. The energy consumption of the 
data cache can be reduced by dynamically shifting to a smaller, 
less associative cache configuration for program phases that do 
not benefit significantly from more aggressive cache configura- 
tions. By targeting only those phases that are predicted to have 
energy savings due to cache size reduction, our scheme is able 
to reduce power with very little impact on the performance. 

We examined an architecture with two possible cache con- 
figurations, 32KB 4-way associative and 8KB direct mapped. In 
Figure 12, the trade off between these two configurations is plot- 
ted. For each program, we use the 32KB cache configuration as 
the baseline result. The labeled circles in Figure 12 show the 
total processor energy savings and performance degradation for 
each program if only the smaller (8KB) cache size is used. For 
example, a processor with a smaller cache configuration for the 
program applu is both 5% slower and uses 5% less energy. 



Slowdown 

Figure 12: Data Cache Re-configuration. The tradeoff between 
energy savings and slowdown for two different cache policies. 
All results are relative to a 32KB 4-way associative cache. The 
circles in the graph (each labeled with a number for the program 
the data point is from) show the energy and performance of an 
8KB direct mapped cache. The triangles show the tradeoff of in- 
telligently switching between an 8KB direct mapped and a 32KB 
4-way data cache based on phase classification and prediction. 

Two programs, vpr and apsi, actually use more energy with a 
smaller cache due to large slow downs. These two points are off 
the scale of this graph and are not shown. 

While examining energy savings and slow down is interest- 
ing, it is important to note that there is more than one way to 
reduce both energy and performance. Voltage scaling in particu- 
lar has proven to be a technology capable of reaping large energy 
savings for a relative reduction in performance. For our results, 
we assume that for voltage scaling a performance degradation of 
5% will yield an approximate energy saving of 15%. We use this 
rule of thumb as our guideline for determining when to reduce 
the active size of the cache. In Figure 12, this simple model of 
voltage scaling is plotted as a dashed line. When the cache size 
is reduced, most programs fall far short of this baseline, meaning 
that voltage scaling would provide a better performance-energy 
tradeoff. There are a couple of exceptions, in particular mcf, 
bzip, and gzip do well even without any sort of phase-based 
re-configuration. 

The shaded triangles in Figure 12 show what happens if 
we use phase classification and prediction to guide our re- 
configuration. When a new phase ID is seen, we sample the IPC 
and energy used for a few intervals using the 32KB 4-way cache, 
and a few intervals for the 8KB direct mapped cache. These sam- 
ples could be kept in a small hardware profiling table associated 
with the phase ID. After taking these samples, if we find that a 
particular phase is able to achieve more than three times the en- 
ergy savings relative to the slow down seen when using the 8KB 
cache, we then predict for this phase ID that the smaller cache 
size should be used. This heuristic means that the small cache 
size is used only if re-configuration would beat voltage scaling 
for that phase. After a decision has been made as to the con- 



0 applu 2 art 4 facerec 6 gcc 8 mcf 

1 apsl 3 bzip2 5 galgel 7 gzlp 9 vpr 



40% -| 
f 30% -| 



*- As as A 



!0% 4 



10% 14 js^'' 

^4 , 



0%i 



O Low Issue 
A Phase Aware 



0% 



10% 15% 
Slowdown 



Figure 13: Processor Width Adaptation. The tradeoff between 
energy savings and slowdown for two different front end poli- 
cies. All results are relative to an aggressive 8-issue machine. 
The circles in the graph (each labeled with a number for the 
program) show the energy and performance of a less aggressive 
2-isstte processor. The triangles show using the phase classifier 
and predictor for switching between 2-issue and 8-issue based 
on phase changes. 

figuration to use for a phase ID, the corresponding cache size is 
stored in the phase profiling table/database associated with that 
phase ID. The phase classifier and predictor are then used to pre- 
dict when a phase change occurs. When a phase change predic- 
tion occurs, the predicted phase ID looks up the cache size in the 
profiling table, and re-configures the cache (if it is not already 
that size) at the predicted phase change. 

For all programs, our re-configuration is able to beat 
or tie voltage scaling. For example, using phase-based re- 
configuration results in a slowdown of 0.5% for applu, while 
the total energy savings is 4.5%. Even the program apsi, which 
had increased energy consumption in the small cache configura- 
tion, is able to get almost 5% energy savings with only a 1% 
slowdown. 

6.3 Dynamic Processor Width Adaptation 

One way to reduce the energy consumption in a processor is to 
reduce the number of instructions entering the pipeline every cy- 
cle [12, 1]. We call this adjusting the width of the processor. 
Reducing the width of the processor reduces the demand on the 
fetch, decode, functional units, and issue logic. Certain phases 
can have a high degree of instruction level parallelism, whereas 
other phases have a very low degree. Take for example the top 
two phases for gcc shown in Figure 7. The intervals classified 
to be in the first phase consisting of 1 8.5% of execution have an 
IPC of 0.61 with a high data cache miss rate. In comparison, 
the intervals in the second most frequently encountered phase 
(accounting for 18.1% of execution) have an IPC of 1.95 and 
very low data cache miss rates. We can potentially save energy 
without hurting performance by throttling back the width of the 
processor for phases that have low IPC, while still using aggres- 
sive widths for phases with high IPC. 



In the current literature, decisions to reduce or increase the 
fetch/decode/issue bandwidth of the processor are made either 
at fixed intervals (relatively short intervals such as 1,000 cy- 
cles) [12] or, as in the case of branch confidence based schemes, 
when a branch instruction is fetched [1]. It can very difficult to 
design real systems that save energy by reconfiguring at these 
speeds, but a hardware phase-tracker can help make these deci- 
sions at a coarser granularity while still maintaining performance 
and energy benefits. 

We examined an architecture that could be configured with 2 
different widths - one where up to 2 instructions are decoded and 
up to 2 issued per cycle, and one where up to 8 instructions are 
decoded and up to 8 issued per cycle. When a new phase ID is 
seen by the phase tracker, we sample the IPC for three intervals 
with a width of 2 instructions, and three intervals with a width 
of 8 instructions. If there is little difference in the IPC between 
these two widths, then we assign a width of 2 instructions to this 
Phase ID in our profiling table, otherwise we assign a width of 
8 instructions. During execution, we use the phase ID predictor 
to effectively predict the width for the next interval of execution 
and adjust the processor's width accordingly. Our results show 
that the chosen configuration for a given phase can be trained 
(1) with only a few samples, and (2) only once to accurately 
represent the behavior of a given phase ID. This requires very 
little training time due to the fact that 20 or fewer phase IDs 
are needed to capture 80% or more of a program's execution as 
shown in Figure 5. 

Figure 13 is a graph of the results seen when applying phase- 
directed width re-configuration. The white circles in the graph 
show the behavior of running the programs on only a 2-wide 
machine relative to the more aggressive 8-wide machine. The 
dotted fine again shows what could potentially be achieved if 
voltage scaling was used. While mcf and art save a lot of en- 
ergy with little performance degradation on a 2-wide machine, 
the other programs do not fair as well. The program apsi, for 
example, has a slowdown of over 22% with an energy savings of 
around 30%. This does not compare favorably to voltage scal- 
ing (as discussed in Section 6.2). On the other hand if we use 
phase-directed width throttling on apsi, a total processor en- 
ergy savings of 18% can be achieved with only 2.2% slowdown. 

For all of the programs we examined, with one exception, 
the slowdown due to phase aware width throttling was less than 
4%, while the average energy savings was 19.6%. This result 
demonstrates that there is significant benefit to be had in the re- 
configuration of processor front end resources even at very large 
granularities. In the worst case, this will mean a re-configuration 
every 10 million instructions, and on average every 70 million 
instructions. This should be designable even under conservative 
assumptions. 

7 Summary 

In this paper we present an efficient run-time phase tracking ar- 
chitecture that is based on detecting changes in the code being 
executed. This is accomplished by dividing up all instructions 
seen into a set of buckets based on branch PCs. This way we ap- 
proximate the effect of taking a random projection of the basic 



block vector, which was shown in [21] to be an effective method 
of identifying phases in programs. 

Using our phase classification architecture with less than 500 
bytes of on-chip memory, we show that for most programs, a sig- 
nificant amount of the program (over 80%) is covered by 20 or 
less distinct phases. Furthermore, we show that these phases, 
while being distinct from one another, have fairly uniform be- 
havior within a phase, meaning that most optimizations applied 
to one phase will work well on all intervals in that phase. In the 
program gcc, the IPC attained by the processor on average over 
the full run of execution is 1.32, but has a standard deviation 
of more than 43%. By dividing it up into different phases, we 
achieve much more stable behavior, with IPCs ranging between 
0.61 and 1.95, but now with standard deviations of less than 2%. 

In addition to this, we present a novel phase prediction archi- 
tecture using a Run Length Encoding Markov predictor that can 
predict not only when a phase change is about to occur, but to 
which phase ID it will transition to. In using this design, which 
also uses less than 500 bytes of storage, we achieve a phase 
prediction miss rate of 10% for applu and 4% for apsi. In 
comparison, always predicting that the phase will stay the same 
results in a miss rate of 40% and 12% respectively. 

We also examined using our phase tracking and prediction 
architecture to enable new phase-directed optimizations. Tra- 
ditional architecture and software optimizations are targeted at 
the average or aggregate behavior of a program. In comparison, 
phase-directed optimizations aim at optimizing a program's per- 
formance tailored to the different phases in a program. In this pa- 
per, we examined using phase tracking and prediction to increase 
frequent value profiling coverage, and to provide energy savings 
through data cache and processor width re-configuration. 

We believe our phase tracking and prediction design will 
open the door for a new class of run-time optimization that tar- 
gets large scale program behavior. Even though we present a 
hardware implementation for phase tracking, a similar design 
can be implemented in software to perform phase classification 
for run-time optimizers, just-in-time compilation systems, and 
operating systems. Hardware and software optimizations that 
can potentially benefit the most from phase classification and 
prediction are (I) those that need expensive profiling/training 
before applying an optimization, (2) those where the time or 
cost it takes to perform the optimization is either slow or ex- 
pensive, and (3) those that can benefit from specialization where 
they have the same code/data being used differendy in different 
phases of execution. By using our dynamic phase tracking and 
prediction design, phase-behavior can be characterized and pre- 
dicted at the largest of scales, providing a unified mechanism for 
phase-directed optimization. 

Acknowledgments 

We would like to thank Jeremy Lau and the anonymous review- 
ers for providing useful comments on this paper. This work 
was funded in part by NSF CAREER grant No. CCR-9733278, 
Semiconductor Research Corporation grant No. SRC-2001-HJ- 
897, and an equipment grant from Intel. 



References 

[11 J.L. Aragon, J. Gonzalez, and A. Gonzalez. Power-aware control specula- 
tion through selective throttling. In Proceedings of the Ninth International 
Symposium an High-Performance Computer Architecture, February 2Q03. 

[2]R. Bnlosubramonian, D. H. Albonesi, 

A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration 
for energy and performance in general-purpose processor architectures. In 
33rd International Symposium on Microarchitecture, pages 245-257, 2000. 

[31 R. D. Barnes, E. M. Nystrom, M. C. Merten, and W. W. Hwu. Vacuum 
packing: Extracting hardware-detected program phases for post-link opti- 
mization. In 35th International Symposium an Microarchitecture, Decem- 
ber 2002. 

[4] D, Brooks, V, Tiwari, and M. Martonosi. Watlch: n framework for 
architectural-level power analysis and optimizations. In 27th Annual In- 
ternational Symposium on Computer Architecture, pages 83-94, June 2000. 

[5] D, C. Burger and T. M. Austin. The simplcscalnr tool set, version 2.0. 
Technical Report CS-TR-97-I342, U. of Wisconsin, Madison, June 1997. 

[6] B. Cnlder, P. Feller, and A. Eustace. Value profiling and optimization. Jour- 
nal of Instruction Level Parallelism, March 1999. 

[71 B. Calder, G. Reinmnn, and D.M. Tullsen. Selective value prediction. In 
26th Annual International Symposium on Computer Architecture, pages 64- 
74, June 1999. 

[81 1.-C. Chen, J. T. Coffey, and T. N. Mudge. Analysis of branch prediction 
via data compression. In Seventh International Conference on Architectural 
Support for Programming Languages and Operating Systems, pages 128- 
137, OcLobcr 1996. 

[91 A. Dhodapkar and J. E. Smith. Dynamic microarchitecture adaptation via 
co-designed virtual machines. In International Solid State Circuits Confer- 
ence, February 2002. 

[10] A. Dhodapkar and J. E. Smith. Managing multi -configuration hardware via 
dynamic working set analysis. In 29th Annual International Symposium on 
Computer Architecture, May 2002, 

[11] M. Huang, J. Rennu, and J. Torrellns. Profile-based energy reduction in 
high-performance processors. In 4th Workshop on Feedback-Directed and 
Dynamic Optimization (FDDO-4), December 2001. 

[12] A. Iyer and D. Marculescu. Power aware microarchitecture resource scaling. 
In Proceedings of the DATE 2001 on Design, automation and test in Europe, 
pages 190-196,2001. 

[13] D. Joseph and D. Grunwnld. Prefetching using markov predictors. In 24th 
Annual International Symposium on Computer Architecture, June 1997. 

[14] M.H, Lipasti, C.B. Wilkerson, and J.P. Shen. Value locality and load value 
prediction. In Seventh International Conference on Architectural Support 
far Programming Languages and Operating Systems, pages 138-147, Oc- 
tober 1996, 

[15] M. Merten, A. Trick, R, Barnes, E. Nystrom, C. George, J. GyllenhanI, and 
Wen mei W. Hwu, An architectural framework for run-time optimization. 
IEEE Transactions on Computers, 50(6):567-589, June 2001. 

[16] M. Mock, C. Chambers, and SJ. Eggers. Culpa: a tool for automating 
selective dynamic compilation. In 33rd International Symposium on Mi- 
croarchitecture, pages 29 1-302, December 2000. 

[17] R. Muth, S.A. Watterson, and S.K. Debray. Code specialization based on 
value profiles. In Static Analysis Symposium, pages 340-359, 2000. 

[18] P. Rangannthan, S. V. Adve, and N.P. Jouppi. Rcconfigurable caches and 
their application to media processing. In 27th Annual International Sympo- 
sium on Computer Architecture, pages 214-224, June 2000. 

[19] T. Sherwood and B. Calder. Time varying behavior of programs. Technical 
Report UCSD-CS99-630, UC San Diego, August 1999. 

[20] T. Sherwood, E. Perelman, and B, Calder. Basic block distribution analysis 
to find periodic behavior and simulation points in applications. In Interna- 
tional Conference on Parallel Architectures and Compilation Techniques, 
September 2001. 

[21] T. Sherwood, E. Perelman, G. Humerly, and B. Calder. Automatically char- 
acterizing large scale program behavior. In Proceedings of the 10th Inter- 
national Conference on Architectural Support for Programming Languages 
and Operating Systems, October 2002. 

[22] J. Yang and R. Gupta. Frequent value locality and its applications. Spe- 
cial Issue on Memory Systems, ACM Transactions on Embedded Computing 
Systems, 1(1):79-105, November 2002. 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 9 
TO DECLARATION OF 
JAMES A FLIGHT 



ports Listed by Year http://72.14.209.104/searchVq=cache:Jlm5~uHXURsJ:www-cse.ucsd.. 



;d.BduJDJBnBUUi;2.0JLislYeare/1B99-Z005?aulharitv= as retrlavad on Jan 9, 2007 21:51:18 GMT. 
3t wo took artho page os wa crawled the web. 

..__,__„_ al Ihna. Click hare for tha current pano without highlighting. 

I This cachod puga may reference images which are no longer avallabto. Click here for (Jib cached lent only. 
To link lo or bookmark this page, use ItiB following urt: 

t Google linellher qfflUmid wilh i*t miHart afthti pagt mr mponrlbtifir lu rwIctJ. 

Those search terms havB been highlighted: technical report 



Reports Listed by Year 

■ 1999 

□ On the Resilience of Broadcasting Strategies in a Failure-Fropognling Environment . CS 1999-06 10. Meng-Jang Lin, Aleta M Ricciordi and Keith Marzullo; July 
■ 6, 1999 

o Sclccling lilc shape for minimal execution lime. CS 1999-0616, Karin Hogstedl, Lorry Carter and Jeanne Feminle; May 20, 1999 ■ ' 
o Dcicnninitiil Ihc idle lime or a tiling. CS 1 999-06 17. Karin Hogstedl, Larry Carter and Jeanne Ferranle; July 6, 1999 

o Adaptive Perrormance Prediction for Distributed Data-intensive Applications, CSI999-06I9. Marcto Facrmbn, Alan Su, Richard Wolski and Francine Berman; 
May IB, 1999 

o Directional Gossip: Oossip in a Wide Area Network. CS 1999-0622. Mcntj-Jann Lin and Ketlh Marzullo; June 16, 1999 

o A Client-Server Oriented Algorithm lor Virtually Synchronous Group Membership in WANs. CS 1999-0623, Idil Keidar, Jeremy Sussman, Keilh Marzullo and 
Danny Dolev; July 7, 1999 

o Minima* Programs and Bilonlo Column Malrices. CSI999-0624, Paul A. Tuckernnd T. C. Hu; June 17, 1999 

a Min Cms Without Pnth Packing. CS1999-0625, Paul A Tucker. T. C. Hu and M. T. Shing; June 17, 1999 

o PC A "Onbnr lor Expression RceoBnition. CS 1 999-0629. Matthew N. Doiley and Garrison W. Cortrell; October 26, 1999 

o Application Scheduling, over Supercomputers: A Proposal. CS 1999-063 1. Walfrcdo Cimc and Fran Berman; October 7, 1999 

o Heuristics far Scheduling Parameter Sweep Applications in Grid Environments. CS 1999-0632, Henri Casanova, Amaud Legrnnd, Omitrii ZagoradnDV and Fran 
Berman; October 14, 1999 

o Pronfs on Safety Tor Unlrusled Code. CS1999-0633, Grigore Rosu and Nathan Segerlind; October 27, 1999 

n Optimistic Virtual Synchrony. CS 1 999-0634, Jeremy Sussman, Idil Keidor and Keith Mareulla; November 9, 1 99? 

o Gossip versus Deterministic Flooding: Low Message Overhead and Hitih Reliability for Broadcasting on Small Networks. CS 1999-0637, Meng-Jang Lin, Keith 

Marzullo and SlefanaMnsini; November 18. 1999 
o Agent Usage Patterns: BridujnR the Gap Between Agent-Based Applications and Middleware. CS 1999-063B. Eugene Hung and Joseph Pasquale; November 19, 

1999 

o AspectBrowscn Tool Support Tor Managing Dispersed Aspects. CS1999-0640. W. G. Griswold, Y. Kato and J. J. Yuan; January 3. 2000 
- 2000 

a Limited Mobile Agents: A Practical Approach, CS2000-064 1, Jesse M. Steinberg and Joseph Pasquale; December 29, 1999 

o Combining Workstations and Supercomputers to Support Grid Applications: The Parallel Tnrnofirophy Experience. CS200D-0642, Shava Smallen, Walfredo 
Cimc, Jaime Frey, Fran Berman, Rich Wolski, Mci-Hui Su, Cnrl Kesselmnn, Steve Young and Mark Elhsman; January 7, 2000 

□ Application Scheduling on the information Power Grid. CS200O-O644, Dmitrii Zagorodnov, Francine Berman ond Rich Wotski; January 1 1, 2000 

o Encode-lhen-cncipher encryption: How irj exploit nonces or redundacv in plaintexts for efficient cryptography. CS2DD0-O646, Mihir Bellare and Phillip 

Ragaway; March 6, 2000 
o Circular Cuindtiction. CS200Q-0647, Grigore Rosu nnd Joseph Goguen; March 14, 20D0 

a Reducing, ihe Overload ol'CoiTipilalion Delay. CS2OO0-O648, Chandra Krintz, David Grove, Derek Liebcr, Vivek Sorkar nnd Brad Colder; March 27, 2000 

o Reducing. PRAM Power Using Compiler Assisted Refreshing. CS20DD-0649. Timothy Sherwood and Brad Colder; April 21, 2000 

a Dynamic Selection of Compression Formats to Reduce Transfer Delay, CS200Q-0650. Chandra Krintz nnd Brad Colder. April 21,2000 

o Scnlnhle Causal Message Lugging for Wide-Arcn Networks. CS2000-06SI, Koran Bhatin, Keith Marzullo and Lorenzo Atvisi; April 21, 2000 

o Abstract Semantics for Mod ulc Composition, CS20D0-0653, Grigore Rosu; May 8, 2000 

o Plane Cover Multiple Access: A New Approach lo Maximizing Cellulor System Capacity. CS20OO-0654, Paul Blair and George Polyzos; May 2B, 2000 
o Mul ti-Lnnnnmuc Support in n Program Analysis and Visualization Tool. CS20OO-O655, Stuart Moskovics; June 20, 2000 
d Design Aulomntiun for Finite Stnlc Machine Predictors. CS2000-O656, Timothy Sherwood ond Brad Colder, June 28, 2000 

□ A Power Efficient Speculative Fetch Architectur. CS2000-0657. Glenn Rcinnwn, Brad Colder ond Todd Austin; June 28. 2000 

□ Uniform Hashing with Multiple Pnssbiis, CS2000-0658, Paul Martini ond Walter Burkhard; August 18, 2000 

a Teaching Software Engineering in a Compiler Project Course. CS20Q0-0659. William G. Griswold; September 12, 2000 

o Exploiting the Map Metaphor in tt Tool for So ft ware Evolution. CS2000-0660, Williom G, Griswold, Jimmy J, Yuan and Yoshikiyo Koto; September 20, 2000 
o Hurwitz Interconnect Delny Evaluation - HIDE: User's Manual. CS20DO-066 1, Xiao-Dong Yang, Zhanhai Qin and Chung-Kuan Cheng; November 9, 2000 
• o Hurwir/. Interconnect Delay Evaluation - HIDE: Programmer's Manual. CS 2000-0662, Zhanfmi Qin and Chung-Kuan Cheng; November 9, 200D 
o UsiriH Annotations 10 Reduce Dynamic Optimization Time. CS200D-O663, Chandra Ckrintzond Brad Colder, November 16, 2000 
• 2001 

□ Learning and Making Decisiuns When Costs and Probabilities ore Both Unknown. C5200 1-0664. Bianco Zadrozny ond Charles Elkon; January 2, 2001 

o Imnlumentalion Techniques for Efficient Dato-FIow Analysis of Larue Programs. CS200I-0665. Darren C. Atkinson ond William O. Griswold; February 3, 
2001 

□ Scalable Measurement: Finding, some Elcphnnls in o Swarm of Ants. CS200l-0666,'Cristian Estan ond George Varghese; February 12, 2001 

o Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. CS2001-0667, Tomolhy Sherwood, Ercz Perelman and 
Brad Colder. March IB, 20D1 

o A Comparative Study of Two Whale Program Sliccrs for C. C5200 1-0668. Lceann Benl, Darren C. Atkinson ond William G. Griswold; April 1 2. 200 1 
o Pockinn topical hierarchies: A comparison of two alpnrithms for reconciling, kevwqnjstruclures.. CS200 1-0669. Bryan Tower, Mark Chaisson ond Richard 
Belew; April 26, 2001 

a Fast Contcnl-Bascd Packet Handling for Intrusion Dclcclinn. CSZ001-0670, Mike Fisk and George Varghese; May 7, 2001 

a Scnlablc Causal Message Logging for Wide-Areo Environments. CS200 1-0671. Karon Bhalio, Keith Moizulio and Lorenzo Alvisi; May 24,2001 

o Reducing Load Delay to Improve Performance orinlemcl-Compming Programs. CS200I-0672, Chandra Krintz; Moy 25, 2001 

o Aggregated Bit Vector Search Algorithms for Packet Filler Lookups. CS200 1-0673, Florin Baboescu and George .Varghese; June 3,2001 

p Proxy Caching with Hash Functions . CS200 1-0674, Florin Bnboescu; June 3, 2001 

d On-line Parallel Tomography. CS200 1-0675. Shavo'Smallen; June 5. 2001 

o Hardware Optimizations Enabled by o Decoupled Fetch Architecture. CS200I-0676, Glenn Rcinmon; June 22, 2001 

3 Predicting Region Branches Using Predicate Update Branch Prediction. CS200 1-0677, Beth Simon, Brad Colder and Jeanne Ferranle; June 25, 2001 
" " - 1e Instruction ROM Architecture. CS200I-0678. Timothy Sherwood and Brad Colder; June 25. 2001 



3/1/07 2:08 PM 



ports Listed by Year http://72J4.209.104/search?q=cache:Jlm5-uHXURsJ;www-cse.ucsd... 
.* * ( 

o Roiattonal Position Optimization (RPOl Disk Scheduling, C52Q0 1-0679. Waller Burkhard and John Palmer; July 16,2001 
o Improved Linux File Svslem Hashing. CS200I-0680, YinB Chen. Waller Burkhard and John Palmer; July 16,2001 

□ Minimum-Buffered Roulinu Of Mon^nlicnj Nets. CS20O1-068 1 . Andrew Kahn E Charles Alport Bno Liu Ion Mandoiu Alexander Zelikovsky ; August 14, 200 1 

□ Rccnuincerine Cocoon with AspeclJ. CS200 1-0682, Lecann Bent; September 4, 200 1 

o Automatically Downloading Imams lo Improve Web Transfer Tames. CS2001-06B3. Glrish Chnndranmenon and George Varghese; September 1 1,2001 

□ FTF-M: An FTP-tike Multicast File Transfer Application. CS200I-06B4. Manamohcn Mysore and George Varghese; September 11,2001 

o Modifying Shonest I'olh Routinn Pmiocols lo Create Symmetrical Routes, CS2Q01-06B5, Rojlb Ghosh and George Varghese; September 1 1, 2001 
o Bounded-Depth Frcce tvilh Counting. Principles Polynomials Simulnles Nullstellcnsaiz Refutations, CS20O1-O6B6, Russell Impagtrazzo and Nathan Segerlind; 
November 14. 200 1 

o Gucssim; Two Secrets with Small Queries. CS200I-0687, Daniels Miccfancio and Nathan Segerlind; November 14, 2001 

o EmcjenLDjJsjpn Space Exploration rorCusiomized Processors, CS20DI-068B. Timothy Sherwood, Mark Oskinand Brad Cnlder; November 20, 200 1 
a Relieving Register File and Instruction Window Pressure, CS2001-06B9. Olenn Reinman. Brad Calder and Todd Austin: November 20, 2001 

□ ProororCornjctness for Sparse Tiling orQauss-Seidel. CS200I-0690, Michelle Mills Strout, Larry Carter and Jeanne Feminie; December 4, 20D1 
o Dynamic Web Stream Customizers, CS200 1 -0691. Jesse Steinberg and Joseph Pasqualc; December 14,2001 

o A Well Middleware Architecture far Dynamic Custnmlzntion of Wch Content for Non-Traditiortal Clients. CS2001-0692, Jesse Steinberg and Joseph Pasqunle; 
December 14,2001 

g Agenl Behavior Patterns in a Wireless Internet Environment. CS2001-Q693, Eugene Hung and Joseph Pasquale; December 17, 2001 

o A Decoupled Predictor-Directed Stream Prefetching Architecture. CS2001-0694, Suleyrrtan Sair, Timothy Sherwood and Brad Colder, December 17, 200 1 

o Turning Predicate Information to Advantage to Improve Compiler Scheduling and Branch Prediction. CS2001-0695. Beth Simon; December 27, 2D01 

• 2002 

o Sources nfSuccess for Information Extraction Methods, CS2002-O696. David Kauchnk, Joseph Smarr and Charles Elkan; January 7, 2002 

o Exponeniipj Separation of Res(k) and Res(k+1 ). CS2002-flfi97, Sam Buss, Russell lmpagliazzo and Nathan Segerlind; January 1 1 , 2002 

o A Modular Framework for Adaptive Scheduling in Grid Application Development Environments. CS2002-O69B, Holly Dnil; January 18, 2002 

o New directions in traffic measurement tind accounting. CS2002J699, Crislion Estan and George Varghese; February 8, 2002 

o Compiler and Hardware Predicated Dependency Analysis and Schedulinu . CS2O02-O700. Lorinda Carter; March 18,2002 

o Automatically Characterizing Lurnc Settle Program Behavior. CS2002-070I . Timothy Sherwood. Erez Perclmon and Brad Calder, March 18,2002 

o Alternatives to the k-means nluorilhrn ihnl find belter clusterings. CS2002-O702, Greg Hamerly and Charles Elkan; April 3. 2002 

a The ActivcClnss Project: Experiments in Encouraging Classroom Parlicinnlion. CS2002-0703, Tan Minh Truong, William G. Griswold, Matthew Ratio and 
Leigh Star, April 24, 2002 

o AcriveCninpus - Susiaininii Educational Communities through Mobile Technology. CS2002-0704. William G. Oriswold, Robert Boyer, Steven W. Brown, Tan 

Minh Truung, Ezekiel Bhnsker, Gregory R. Jay and R. Benjamin Shapiro; April 24. 2002 
o Counting the number of active Hows on a high speed link. CS2002-070S, Cristtan Estan, George Varghese and Mike Fisk; May 21, 2002 
o Linear Netwttrk Reduction Via generalized Y-5NDc1taS Transformation: Theory, CS20D2-O706, Zhnnhai Qin and Chung-Kuan Cheng; May 22, 2002 

□ The Virtual Instrument Support for Grid-enahled Scientific Simulations. CS2002-07Q7, Henri Casanova, Thomas Bartol, Francine Bcrman, Adam and 
Dongorra Btmbaum Jack , Mark Ellisman, Marcio Faermnn, Ethan Goekay, Michelle Miller, Graziano Obertelli, Stuart Pomerantz, Terry and Stiles Sejnowski 
Joel and Rich Wolski; May 3 1 , 2002 

a A Modular Scheduling. Approach for Grid Application Development Environments. CS2002-0708. Holly Dail, Henri Casanova and Fran Berman; June 5, 2002 
o JBIG ComiwcBsinn Algorithms for "Dummy Ftlli* VLSI Layout Data. CS2002-Q709, Robert Ellis, Andrew Kahng and Yuhong Zheng; June 14, 2002 
o Phase Tracking and Prediction. CS2002-0710, Timothy Sherwood, Suleyman Sair and Brad Calder, June 23, 2002 

o Optimized Trace Binaries for Architectural Evaluation. CS20O2-O7 1 1 , Suleyman Sair. Yuanfang Hu, Timnlhy Sherwood and Brad Calder, June 23, 2002 

o Pointer Cache Assisted Speculative Precompuiatioti. CS2002-O7 12, Jamison Col lins, Suleyman Sair, Brad Colder and Dean Tullsen; June 23, 2002 

o ActiveCampus -Sustnininu Educational Communities through Mobile Technology, CS2002-O7I4. William G. Griswold. Robert Boyer, Steven W. Brown, Tun 

Minh Truong, Ezckiel Bhasker, Gregory R. Jay and R. Benjamin Shapiro; July 8,2002 
a The AcliveClnss Project: Experiments in Encouraging Classroom Participation. CS2002-0715. Tan Minh Truong, William G. Griswold, Matthew Ratio and 

Susan Leigh Stan July 8, 2002 - 
a Learning, the k in k-means. CS2002-07I6, Greg Hamerly and Charles Elkan; July 30, 2002 

a The Y-nrchileciure: Yet Another Qn-Chip Interconnect Solution- CS2002-O7I7. Hongyu Chen, Feng Zhou and Chung-Kuan Cheng; August 7, 2002 
o Fast and Scalable Conflict Detection lor Packci Classifiers. CS2002-071 8, Florin Baboescu and George Varghese; August 7, 2002 

o Packet Classification far Core Routers: Is there an nllernnlivcs to CAMs?. CS20D2-07 1 9, Florin Baboescu, Sumeet Singh and George Varghese; August 7. 2002 

□ Resource Allocation forSteerable Parallel Parameter Searches: an Experimental Sludy. CS20Q2-O720, Marcio Faerman, Adam Bimbaum, Henri Casanova and 
Fran Berman; August 18, 2002 

o A Multi-Round Algorithm for Scheduling Divisible Workload Applications: Analysis and Experimental Evaluation, CS2002-0721, Yang Yang ond Henri 
Casanova; September 26, 2002 

o Synchronous Consensus for Dependent Process Failures, CS2002-O722, Flavio Junqucira and Keith Marzullo; October 3, 2002 

o Coning with Dependent Process Failures. CS2002-0723. FlavtD Junqueira, Keith MaraullD and M. Voelker Geoffrey; October 7, 2002 

o Usiim Mobile Technology lo Create Opportunistic Interactions on a University Campus. CS2002-Q724. William G. Griswold. Robert Boyer, Steven W. 

Brown, Tan Minh Truong, Ezekiel Bhasker, Gregory R, Jay and R. Benjamin Shapiro; October 1 6. 2002 
o Group Membership and Wide-Area Master-Worker Compulal ions. CS2002-0725, Kjetil Jacobsen, Xianan Zhang and Keith Marzullo; November 6, 2002 
a Replication Strnteufes for Highly Avail nhle Peer-lo-peer Storage Systems. CS2Q02-O726. Ranjita Bhagwan. Sterun Savage and Geoffrey M, Voelker, November 

6,2002 

o Using SimPoinls in Diverse Simulation Environments, CS2002-O727, Erez Perelman, Michoel Van Biesbrouk, Timothy Sherwood and Brad Colder, November 
16.2002 

o Interaction of Virtual Machine xv-iih the Operaling System, CS2002-0728, Kiran Tali and GeofTrey M. Voelker; December 2, 2002 
o Whole Pnitu Performance. CS2002-O729, Lecann Bent and GeofTrey M. Voelker, December 16, 2002 

o HYPERCUTS: A Dcctsinn Tree Based Algorithm for Fast Packet Classification. CS2002-0730, Sumeet Singh, George Varghese and Florin Baboescu; 
December 12, 2002 

o Security in the Sand nary System, CS2002-0731, Matthew Hohlreld, Aditya Ojha and Bennet Yce; December 20, 2002 

• 2003 ~ 

a The Phoenix Recovery System: Rebuilding from the ashes of nn Internet catastrophe. CS2003-0732. Flavio Junqueira, Ranjila Bhagwan, Keith Marzullo, Stefan 

Savage and Geoffrey M. Voelker; January 13, 2003 
a Connectivity in the South American Inlemel. CS2QD3-0733, Flavio Junqueira and Renata Teixcfra; January 13. 2003 

o Luwcr Bound on the Number of Rounds for Consensus with Dependent Process Failures- CS20D3-0734, Flavio Junqueira and Keith Marzullo; January 13. 2003 

□ MP1 Process Swapping: Architecture and Experimental Verification. CS2003-0735, Otto Sievert and Henri Casanova; January 29, 2003 

u Packet Classification Using Multidimensional Cutting. CS2003-O736, Sumeet Singh, Florin Baboescu, George Varghese and Jia Wang; February 7, 2003 
o Consensus for Dependent Process Failures, CS2003-O737, Flavio Junqueira ond Keith Marzullo; February 1 8, 2003 

d Bitmap algorithms for counting active mows on high speed links. CS2003-073B, Crislian Estan, George Varghese and Mike Fisk; March 13, 2003 
o Query Set Speclflcntipn Language (QS5LV CS2003-0739, Michalis Petropoulos, Alin Deutsch and Yonnis Papakonslnntinou; March 24, 2003 
a Incrcasinu Object Visibiljiy In Dcconlnilized Unstructured Pecr-To-Pccr Ndworks Usini; Conlenl Based Routing. CS2O03-O740, Sumeel Singh ond Florin " 
Baboescu; March 28, 2003 

o Proof orCornictness for Sparse Tilinu ufQauss-Scidel. CS2003-074 1 , Michelle Mills Slrout. Larry Carter and Jeanne Ferrante; April 1 , 2003 

o The case for ISP deployment ol'surjcr-pccrs in P2P networks. CS2003-0742, Sumeet Singh, Sriram Rnmabhadran. Florin Baboescu and Alex Snoeren; April IS, 



if 4 



3/1/07 2:08 PM 



ports Listed by Year tittp:/r/2.14.id0y.lU4/searchVq=cac]ie:JIm5-uHXUKsJ:www-cse.ucsd.. 



2003 

o On the Generalfzarion of n>k't. CS2003-0743. Flavin Junqueira and Keith Marzullo; April 21. 2003 

a A Bow-based Task Scheduling Slratcuy for Distributed Systems. CS2003-0744. Sagnik Nandy. Jeanne Fcrrante and Lorry Carter. May 2. 2003 
a Real-lime Detection of Known and Unknown Warms. CS2Q03-0745, Sumeet Singh, Cristian Estan, George Varghese Bnd Stefan Savage; May 30, 2003 
a Automatically Infcrrinu Patlerns of Resource Consumption in Network Traffic. CS2003-O746. Cristian Estnn, Stefan Savage and George Varghese; June 2, 
2003 

o The Measurement Manifesto. CS20D3.0747, George Varghese and Cristian Estnn; June 4, 2003 

o Online Lund Balancing and First-Hop Bandwidth Allocation in Public-Area Wireless Networks. CS2003-O74&. Aniind Bnlachandran, Sngnik Nandy, Venfcal P. 

Rangan and GeofTrey M. Voelker. June 10. 2003 
o The Imparl ofAddrcss Allocnlbn and Routing, an the Structure and Implementation ofRoutinKTnhles, CS20O3-O749, Harsho Narayan. Ramesh Oovindan and 

George Vorghcse; June 19. 2003 

o AclivcCnmpus - Experiments in Communilv-Orienled Ubiquitous Computing. CS2003-0750, William G. Griswold, Patricia Shanahan. Steven W. Brown, 

Roberl Bayer, Matt Ratto, R. Benjamin Shapiro and Tan Minh Truong; June 24, 2003 
a Buckinu Free-Riders: Distributed Accounting and Settlement in Peer-lo-Pcer Networks. CS2003-O75 1, Abhishek Agmwal, Douglas Brown, Adiryn Ojhn ond 

Stcran Savage; June 24, 2003 
• ' a Application-Tumid Processor Architectures. CS2003-0752. Timothy Sherwood; June 25, 2003 . 

° Predictor-Directed Data Prefetching for Pointer-based Applications, CS2003-O753, Suleymnn Sair, June 25, 2003 

o Ex-iensfotisto the Multi-lnstallment Algorithm: AiTine Casts cndJ3ujpjjtJ3 a j a Transfers. CS2003-0754, Yang Yan E and Henri Casanovn; July 16, 2003 
o Cnne:Aunmcntinu DUTs to Support Distributed Resource Discovery. CS2003-O755, Ranjita Bhagwan, George Varghese and GeofTrey M. Voelker, July 21, 
2003 

D A Multiple Level Network Approach for Clock Skew Minimization with Process Variations. CS2003-0756, Makcto Mori. Hungyu Chen. Bo Yco and 

Chung-Kuan Cheng; July 28, 2003 
o Structures ond Algorithms for Phase Classification. CS2D03-07J7. Jeremy Lnu, Stefan Schocnmackers and Brad Caldcr. July 29,2003 

o GRYD: Generalized Reduced-Order Wye-Delta Transformation: Programmer's Manual for Reduction Engine and Annlicalinns. CS2003-0758, Zbanhai Qin and • 
Chung-Kuan Cheng; July 3 1, 2003 

" GRYD: Generalized Rcdticcd-Ordcr Wye-Dclln TranslfarmotiDn: User's Manual for Reduction Ennine ond Applications, CS20D3-0759, Zhnnhat Qin and 

Chung-Kuan Cheng; July 3 1, 2003 
o Benchmark Probes for Grid Assessment, CS2003-0760, Greg Chun, Holly Dail, Henri Casanova and Allan Snavely; August I, 2003 

o The EarlvBird System for Real-time Detection of Unknown Worms. CS20Q3-O761 , Sumeel Singh, Cristian Estnn, George Varghese and Stefan Savage; August 

4.2003 ' . 

° Segmentation bv Example. CS2003-0762. Snmeer Agarwol and Serge Belongie; August 14, 2003 

° Three Brown Mice: See How They Run. C52003-0763, Kristin Branson, Vincent Rabaud and Serge BelDngte; August 1 9, 2003 
d Approximation Methods forTliin Plata Spline Mappinus and Principal Warps. CS2003-0764. Gianluca Donalo and Serge Belongie; September 4, 20D3 
o Employinn User Feedhnck for Past. Accurate. Low-Maintenance Gealocalioninu. CS2003-0765. Ezekiel S. Bhasker, Steven W. Brown and William O. 
Griswold; September S, 20D3 

n An Adaptive System for Real-time Summaries of Internet Traffic. CS2003-076fj, Cristian Eslon, Ken Keys and David Moore; September 24. 2003 
o Structure from Periodic Morion. CS2003-O767, Serge Belongie and Josh Wills; October 10, 2003 

«• A Feature-bused Approach for Determining Dense Long Range Corrcsnondcnces. CS2003-076B, Josh Wills and Serge Belongie; October 20, 2003 

o Chcractcrfeing and Evaluating, Desktop Grids: An Empirical Study. CS2003-0769. Derrick Kondo, MichclaTnufer, John Kanwicolos, Charies L. Brooks, Henri 

Casanava and Andrew Chien; October 22, 2003 
o DGMonitPr: a Performance Mtinitnrinu Tool for Sundbox-based Desktop Grid Platforms. CS2003-0770. Pletro Cicotti, MichelaTnuferand Andrew Chien; 

October 24, 2003 ~" ' " 

° A Cti-Phn-iC Mntrix to Guide Simultaneous Multithreading Simulation. CS2D03-O77I. Michael Van Biesbrouck, Timothy Sherwood and Brad Caldcr; October 
2B.2003 

□ Structures far Phase Classification. CS2003-O772, Jeremy Lau, Stefan Schoenmackers and Brad Calder, October 28, 2003 

□ The Enlrnni'n Virtual Machine for Desktop Grids. CS2003-O773, Brad Caldcr, Andrew Chien, Ju Wang and Don Yang; October 2B, 2003 

o Code Puinter Protection From Buffer Overflow Through Targeted Hardware Encryption. CS20D3-O774, Nathan Tuck, Brad Calder and George Varghese; 
December 1,2003 

■ 2O04 

° One Dimensional Knapsack. CS2004-077S. T. C. Hu. M. T. ShinK and Leo Landn; January 14. 2004 

o Using Network How Buffering to Improve Performance of Video over HTTP. CS2004-0776. Jesse Steinberg and Joseph Posquale; January 14, 2004 

o A Nccr-Oplimnl Algorithm for a Localilv-Mnximizlnu, Placement Problem. CS2004-0777, Fan Chung, Ronald Graham, Ranjita Bhagwan, Stefan Savage and 

GeofTrey M. Voelker, January 16,2004 
o Tde-Rcalltv farllie Rest of Us. CS2004-0778. Neil McCurdy and William Griswold; January 16, 2004 

a Critical Poinls for Interactive Schema Matching. CS2004-0779, Guilian Wang, Joseph Goguen, Young-Kwong Nam and Kal Lin; January 30, 2004 

□ Accuss ond Mobil ily of Wireless PDA Users. CS20O4-0780, Mnrvin McNeil ond GeofTrey M. Voelker. February 9, 2004 

a Buildinn a Hierarchy of Variable LcnKih Intervals to Capture Hierarchical Phase Behavior. CS2004-078 1, Jeremy Lnu, Ercz Perclman, Greg Hamerty, Timothy 

Sherwood and Brad Calder, March 13, 2004 
o Coiict A Distributed Henri Approach to Resource Selection. CS2004-O7B2, Ranjita Bhagwan, Priyo Mahndevan, George Varghese and Geoffrey M. Voelker; 

March 22. 2004 

° Ontimizinu the Knapsack Problem. CS2OO4-0783. Leo Lnnda; April 2, 2004 

o Comparison belweun multistage fillers and sketches for finding heavy hitlers. C5ZO04-O784, Cristian Estnn; April 27, 2004 

o APST-DV: Divisible Load Scheduling, and Deployment on the Grid. CS2004-O78S. Krijn van dcr Rnadt, Yang Yang and Henri Casanova; April 28. 2004 
o OptlPutcr System Software Framework. CS2004-O786, Xinran (RyanJ Wu Andrew A. Chien Nut Tnesombut Eric Weigic Huaxia Xia and Justin Burke; April 
28, 2004 

o Sync-scan: A fast hand-off procedure for 802.1 1 link layer roam inn. CS2004-0787, ishwar ramani and Stefan savage; May 3, 2004 
o Undcislnnilinti When Localion-Hidint; Using Overlay Networks Is Feasible. CS2Q04-O788, Ju Wang and Andrew Chien; May 9, 2004 
o Peteelinn Malicious Routers. CS2004-O789, Alper Mizrak. Keith Marzullo and Stefan Savage; May 24, 2004 

o Scwi-paramctric (apanenliol family PCA : Rcdiicimt dimensions via ncin-nnramelric latent distribution estimation. CS20O4-O79O, Sajnma Sajama and Alon 
Orlitsky; June 2, 20D4 

d Fulcrum - An Open-Implementation Approach to Context-Aware Publish / Subscribe. CS2004-0791. Roberl T. Boyernnd William G. Griswold; June 8, 2004 
o MobiNct: A Scalable Emulation Infrastructure for Ad Hoc and Wireless Networks. CS2004-0792. Priyn Mahadcvan, Adolfo Rodriguei. David Becker and 

Amm Vahdat; June 14, 2004 
d Unified Summaries fiir Internet traffic. CS2004-0793. Crislian Estan; June 15, 2004 
a Snce Algorithms for Knapsack Problem. C52004-0794. Leo Landa; June IS, 2004 

o Network Telescopes: Technical Report. CS2004-079S, David Moore, Colleen Shannon, GeofTrey M. Voelker and Stefan Savage; July 7, 2004 
o Compullnu 1he Optimal Makespnn for Jobs with Identical and Independent Tasks Scheduled on Volatile Hosts. CS2D04-O79fi, Derrick Kondo and Henri 
Casanova; July 1 2, 2004 

o Usinu Prouram Phases as Metn-Datn for Runtime Energy Optimization. CS20D4^0797, Cristiano Pcrcira and Rajesh Gupta; July 14. 2004 

o Evaluation nfa Hieh Performance Erasure Code Implementation. CS20O4-0798. Frank Uvcda. Huaxia Xia and Andrew A. Chien: September 13. 2004 

o A New Direction in Tree Bnsud Search Engine Architectures Using Balanced Sinnle Pnn Memories, CS2004-0799, Florin Baboescu and Dean Tullsen; October 



3/1/07 2:08 PM 



:orts Listed by Year ■ http://72 .14.209.1 04/search?q=cache:Jlm5-uHXURsJ:www-cse.ucsd.., 

? * • f" 

I5.20D4 

o Declarative Resource Naminit Tor MacronroBraniminu Wireless Networks ofEmbcddcd Systems. CS2004-0S00, Chalermek Inlanagonwiwat, Rajesh Gupla and 
Arain Vahdni; November 2. 2004 

o A Placement Methodology for Global Interconnect Reduction and lis Impact on Performance, CS2004-0801. Andrew Karma, tgar Markov and SheriefRedo; 
October 3 1. 2004 

o APST-DV: A Practical Framework for Scheduling and DenlovinR Divisible Loads on Orid Platforms. CS2004-O802, Krijn van der Roadt, Yang Yang and Henri 
Casanova; November 9, 2004 

o Efficient Sampling Startup for Uniprocessor and Simultaneous Mullithreadinu Simulation . CS2004-OB03. Michael Van Bicsbrouck, Lieven Ecckhoul and Brad 
Calder; November 28, 2004 

o Sulcctini: Soriv.ii re Phase Markers wMh Code Structure Analysis . CS2004-0804. Jeremy Uu, Erez Pereltnan and Brad Colder, November 28, 2004 

o Efficienl Bounds Clicckhm for C . CS2004-0805, Welha w Chuang, Satish Narayanasamy and Brad Calder, November 28, 2004 

o CL1DE: Interactively FormulnlinB Feasible Queries on Query Rcwrilinn-Based Systems. CS2004-0S07, Michalis Petropoulos, Alin Deuisch and Yonnis 

Fapakonstanltnou; December 12, 2004 
p Criticnl-Palh Aware Processor Architectures. CS20D4-0808. Eric Tune; December 16,2004 

d Eftkicnl Resource Description and Hinh OualiTV Selection for Virtual Grids. CS2004-0809. Yang-Suk Kee, Dionyslos Logothetis, Richard Huang, Henri 



o Limit results on pattern entropy. CS2004-081 1, Alon Orlitslcy, Narayana Santhonam, Krishnamurthy Viswanathan and Junan Zhang; December 27, 2D04 
. 200J 

□ Weak leader election for receivc-omission process failures, CS2005-OB 12, Flavio Junqueira and Keith Marzullo; January 26, 2005 
o A Systems Architecture fur Ubiquitous Video. CS2005-0813, Neil J. McCurdy and William G. Griswold; February 4, 20D5 

o Hamessimt Mobile Ubiquitous Video. CS2005-OB14. Neil J. McCurdy and William G. Griswold; February 4, 2005 

o Copjnu with Inicmcl catastrophes, CS2005-OBI6, Flnvio Junqueim, Ranjila Bhagwan, Alejandro Hevia, Keith Marzullo and GeofTrcy M, Voelker, February 17, 
2005 

□ The Vinunl Grid Description Lanj-uuRc: vnDL. CS2005-081 7, Andrew Chien, Henri Casanova, Yang-suk Kee and Richard Huang; February 1 8, 2005 

a NP-Complclcmas of the Divisible Load SchcdulintrProhlem on Hcteronencous Slar Platforms with A nine Costs. CS2005-O818. Arnttlid Legrand, Yang Yang 
and Henri Casanova; March 10, 2005 

□ Accuracy Bounds For The Scaled Bitmap Data Structure. CS2005-OB1 0 . Sumeet Singh, Cristian Estan, George Varflhese and Stefan Savage; March 22, 2005 

□ Placc-lis: Locnlion-Bascd Reminders on Mobile Phones. CS200S-0820. Timothy Sarin. Kevin Li, Gunny Lee, Ian Smith, James Scott and William Griswold; 
March 23, 2005 

a Auiomntic CnlorCnlibmtion for Large Camera Arrays. CS2005-0821, Neel Joshi, Bennett Wilbum, Vaibhnv Vnish, More Levoy Levoy ond Mark Horowitz; 
May I !., 20D5 

a The PDwernfSliciitB in Internet Flow Measurement. CS200S-O822, Rarriana Rao Estan Kc-mpella Cristian ; May 13. 2005 

□ Enhanced Dcsiun Flow and Optimizations for Multi-Proiccl Wafers. CS2005-0B23, Andrew Kahng, Ion Mnndoiu, Xu Xu and Alex Zclikovsky; May 14, 2005 
o The Overlay Network Content Distribuiion Problem. CS200S-OB24, Chip Killian, Michael Vrable, Alex C. Snoeren. Amin Vahdat and Joseph Pasquale; May 

18,2005 

o Combined Selection and Bindinc for Competitive Resource Environments. CS2005-0B25, Yang-Suk Kcc, Henri Casanova and Andrew A. Chien; May 18, 2005 
o Evaluating Location Based Reminders. C52005-OB26. Kevin A. Li. Timothy Sohn and William G. Griswold; May 1 8. 2005 

o Declarative Resource Naming for Mncroprogmmminu Wireless Networks of Embedded Systems. CS2005-O827, ChoJermek Intanagonwiwal, Rajcsh Gupta and 
Arain Vnhdat; May 30. 2005 

o Weak Leader Election in the receive-omission failure model. CS2005-0829. Flavio Junqueira and Keith Marzullo; June 1, 2005 
a EiTicient Cooperative Scheduling in 802.1 1 Wireless Nctworks.' CS2005-OB30. Ishwar Ranani, Romana Rao Kompella, Sriam Rai 
July 7, 2005 

o Coterie nvailahility in sites (extended version}. CS2005-083 1 , Flovio Junqueira rind Keith Marzullo; July 27, 2005 
o A Scalable Capstone Course for Academic Preparation. CS2005-OB32, Will torn G. Griswold; August 28, 2005 
° Reeounizinu Cars. CS2005-OB33, Louka Dlagnckov and Serge Bclongte; September 2B, 2005 
o Maximum Instantaneous Power Estimation by Subgraph Colorinu. CS2005-OB34, Ban Liu: October 12.2005 

o Fecdthroueh Channel Effeet on Wirclennth Distribution in the Presence of Obstacles. CS20D5-OB35, Andrew Cheng Kahng Chung-Kuan Liu Bao Stroobandl 
Dirk; October 12, 2005 

d Chnruc-Maichimi Tail Approximation in a Piece-wise Lineor-and-Exponenlial Function. CS20O5-O836. Bao Liu; October 12, 2005 

d Nl'-Cumpluieiicss mid Approximation Scheme ut'Zero-Siwv Clock Tree Problem. CS2P03-0B37, Bao Liu; October 13. 2005 

o Soltwurc I'ttifiliiiH for Dcieritiiiiiaiie Replay Deliuuiu'nit ofUser Code . CS2005-0839. Satish Narayanasarny and Brad Calder, October 18, 2005 

o Automatic LoituinK ofOperotiim System Effects to Simplify Application-Level Arehiteciure Simulation. CS2005-OB40. Satish Nnraynnasamy. Crisliano Pereira, 

Harish Patil, Roben Cohn and Brad Calder, October 18, 2005 
o Camparini! Multinomial and K-Mcans Clusterinu for SimPoint. CS2005-O84 1 , Greg Hamerly, Erez Perelman and Brad Calder, October 20, 2005 
o Peer-lo-Pecr Error Recovery for Hybrid Satellite-Terrestrial Networks. C52C05-OB42, Eric Weigle, Matti Hillunen, Rick Schtichtlng, Vinay Vaishampayan and 

Andrew A. Chien; October 3 1. 2005 

□ Efficient Hnrdwnre Support for Deterministic Replay DebuwRinK of Memory Races, Interrupts and Self Modifying. Code. CS2005-Q843, Satish Narayanasamy, 
Cristiano Pereira and Brad Calder, November 14, 2005 

o DeleclinH Phases in Parallel Applications on Shared Memory Architectures. CS200S-0B44. Erez Perelman, Morzia Polito, Jean- Yves Bouguet, John Sampson, 



d XML Queries Usinu Nested Views. CS2005-O846, Emiran Curtmola, Alin Deutsch, Nicola Onose and Yannis Fapakon 



y, Aysc Coskun and Brad Colder; December 18,2005 



Michael Van Bicsbrouck, Ganesh Venkatesh, Osvaido Colavin and Brad Colder, December 18,2005 



NCSTHL 

This server operates at UCSD Computer Science and Engineering. 
■bmaxtcrGlcAucstU'th, 



f4 



3/1/07 2:08 PM 



i 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 10 
TO DECLARATION OF 
JAMES A FLIGHT 



sorts Listed by Author http://72.14.209. l04/search?q=cache:5i3alxEiXT8J:www,cs.ucsd,e... 



. This Is G o o g I o'6 cache of hl1o:;/wyw,es.ucsd.edu/DienslAJIJ2.0/LlslAulhors;A~Z7aulhorltv=ncalrl.ucsd ese as retrieved on Dec 29. 2006 13:26:2B GMT. 
G o o Q I o'b cache is lha snapshot thai we look ol Ihe page as We crawled the web, 
Tha page may have changed since (hat Ume, Click here for lha cj 



This cached page may reference images which are no longer available. Click here for tha cached text only. 
— • ' '• ' "— fallowing url: 

.StamlitElimlJiKww.cj.uc.d.oiWOliini^/OI^.O/Lljl^ehQ^ 

Goo fit bntith rr pJJIftnllif *lth lh e authuri aflhlt pa/r imr rtipaniMtfir lu curtail. 



I To link lo or bookmark this pegs, use the fallowing url: 

! httpt//»n™.googl«.coin/>o*rchn " 



Reports Listed by Author 



0^52006-0871, November 19. 2006 
C52003-O762, Aurjust 1 4, 2003 



sumentation b' 
■ Ayrawal, A. 

~ ucking Free-Riders: Distributed Accounting and Settlement in Peer-to-Peer Networks. CS2003-075 1 , June 24, 2003 



a Distributed Application Management Using Plush. C52006-OB64. July 31. 2006 
■Alvisi.L. 

□ Scalable Causal Message Logging for Widc-Arcn Networks. CS2000-06S 1. April 21. 2000 

□ Scalable Cpusnl Message Logging for Wide-Area Environments. CS2001-067 1. May 24.2001 

• Amer-Yahia, S. 

□ Flexible tmd Efficient XML Search with Complex Full-Text Predicates. CS2D0S-OB4S. December 12, 2005 

• Andrew A. 

□ OnllPuicr System Software Framework. CS2004-O7B6, April 28, 2004 
■ Atkinson, D. 

□ Implementation Tcchnluucs far Efficient Dnla-Flow Analysis ofLarge Programs. CS200I-0665, February 3, 2001 

□ A Comparative Study of Two Whole Premium Slicers forC. CS2001-0668, April 12, 2001 

• Austin, T. 

a A Power Efficient Speculative Fetch Arcliitcctur. CS2QQ0-0657, June 28, 2000 

□ Relieving Rceistur File and Instruction Window Pressure. CS20O1-06B9. November 20. 2001 

• Bnboescu, F. 

o Aggregolcd Bit Vector Search Algorithms for Paclccl Filter Lookups. CS2001-0673, June 3. 2001 
° Proxy Caching with Hash Functions . CS2001-0674. June 3.2001 

o Fast and Scalable Conflict Detection lor Packet Classifiers. CS2002-O71 8, August 7, 2002 
o Pmket CltiBsificnlinn Tor Core Routers: Is there nn alternatives to CAMs7. CS2002-O7 1 9, August 7, 20D2 
a HYPERCUTS: A Decision Tree Based Algorithm far Fast Packet Classificntion. CS2002-O730. December 12, 2002 
. a Packet Classification Using Multidimensional Culling. 052003-0736. February 7, 2003 

o Increasing Ohiect Visibility In Decentralized Unstructured Peer-To-Peer Networks Using Content Based Routine. CS2003-0740, March 28, 2003 
o Tha case for ISP deployment nfsuper-riecrs in P2P networks. CS2003-O742. April IS, 2003 

o A New Direction in Tree Based Search Engine Architectures Using Balanced Simile Port Memories. CS2004-O799. October IS, 2004 

• Balachnndrnn, A. 

o Online Load Balancing und Firsl-Hon Hnndwidth Allocation in Public-Area Wireless Nulworks. CS2003-074B. June 10,2003 

• Bnnol.T. 

o The Virtual Instrument: Support fur Grid-enabled Scientific Simulations. CS2002-0707, Moy 31, 2002 

• Becker, D. 

o MobiNct: A Scalable Emulation Infrastructure for Ad Hoe and Wireless Net works. CS20O4-0792. June 14,2004 

• Belew, R. 

. o packing iDpical hierarchies: A comparison of two algorithms for reconciling keyword structures. CS2001-O6fi9. April 26,2001 

• BeUardo, J. 

o Jigsaw: Sol vim; the Puzzle pfEmcrprisc B02.ll Analysis. CS 2006-0 849, February 2 1, 2006 

• Bettare, M, 

a Encode-lhen-cneitiliercncrypi ion: How to exploit nonces orredundncy in plaintexts forclTicicnl cryptography. CS200O-O646, March 6, 2000 

• Belongie, S. 

o SeumenmHon by Example. CS2003-0762. Annus! [4.2003 

d Appropriation Methods for Thin Plate Spline Mappings and Principal Warps. CS2003-0764. September 4. 2003 
o . Structure I'rnni Periodic Motion. CS2O03-0767, October 10,2003 

d A Feature-based Approach for Determining Dense Lona Ranee Cnrresnondences. CS2003-O768, October 20, 2003 
n Recognizing Cars. CS2005-O833, September 28, 200S 

• Benko, P. 

o Jigsaw: Solving the Puzzle of Enterprise 802. 1 1 Analysis, CS2006-OB49, February 21, 20D6 

• Bent, L. 

o A Comparative Study of Two Whole Program Slicers for C. CS2001-O668. April 12,2001 
o Reengincering Cocoon with AspccU. CS200 1-0682. September 4. 2001 



^2002-0729, December 16, 2002 

• Bermnn, F. 

o Adaptive Performance Prediction for Distributed Datn-lnicnsiyc Amplications. CS 1999-06 19. May IB, 1999 

□ Anplkntiun Scheduling over Supercomputers: A Proposal. CS 1999-063 1. October 7. 1999 

o Heuristics for Scheduling Parameter Sweep Applications in Orid Environments. CSI999-0632. October 14, 1999 

a Comhinfnu Workstations and Supercomputers in Support Grid Applications: The Parallel Tomography Experience, CS2000-0642, January 7, 2000 
o Application Scheduling on Ihe Information Power Grid. CS2000-O644. January II . 2000 



o The Virtual Instrum 

° A Modular Schedal - 

o Resource Allocation forSleernble Pnrnllcl Parameter Searches: 



y. CS2002-0720, August IB; 2002 



3/1/07 2:08 PM 



[jorfs Listed by Author 



http://72. 1 4.209. 1 04/search?q=cache:5 i3 a I xEiXTSJ ;www.cs.ucsd.e... 



o Rcgljcglion Strateuies Tor Highly Available Pccr-to-pccr Storage Svslcn 
□ The Phoenix Recovery System: Rebuilding from the ashes of nn lnleme 
a Cone:Augniciitinu DHTs to Support □ 



• Bhasker, E. 

o AclivcCnmpus - Sustaining Educational Communities ihrouuh Mobile TcchnolOKV. CS2002-O7Q4, April 24, 2002 

a AclivcCnmpus - Sustnittinu Educalionnl Communities ihrouuh Mohilc Technology, CS2002-0714, July 6, 2002 

o Using Mobile Technrriouv in Create Opnonunitistic Interactions on a University Camnus. C52002-O724, October 16, 2002 

□ Employing User Feedback ibr Fas:, Aec urate, Low-Maintenance Oeolocnlioninn ,. CS2003-0765, Sepiember B, 2003 

• Bhotin. K. 

o Scalnble Causal Message LoBBinc, for Wide-Area Networks, CS2000 1, April 21 ,2000 
3 Scalable Causal Message Lnpuiim for Wide-Area Environment. CS20Q1-0671. May 24. 2001 



a The Virtual Instrument: Support Ibr Grid-enabled Scientific Simulations. CS2002-0707, May 3 1 , 2002 
o Resource Allocation for Stcemhlc Plirallel Parameter Searches: an Exnerimcntttl Study. CS2002-0720, August 18, 2002 
• Blnir, P. 



m Cover Multiple Access: A New Accroach h 



iu Cellular System Capacity. CS200Q-0654, May 28. 2000 



s Detecting Phases in Pnmllel Applici 



n Shared Memory A 



S.CS2005-0844. November 20, 2005 



• Bayer, R 

o AcliveCnmpus - Sustaining Educational Communings ihrouuh Mobile Technology, CS2002-0704, April 24, 2002 
o AcliveCampus - Sustaining Educational Communities through Mobile Technology, CS2002-07 14. July 8. 2002 
o Usinu. Muhllc Technology lo Crenlc Onportunilislic Interactions on a Universilv Campus. CS2002'0724. Ociober 16, 2002 



» ActiveCmnnus - I 



3 Fulcrum - 



n Community-Oriented Ubiquhou 



\mL CS20O3-O750, June 24. 2003 . 



:t-Aware Publish / Subscribe. CS20O4-O79I, June 8. 2004 



■ Branson, K. 

° Three Brown Mice: See How They Run. CS2C03-O763. August 19, 2003 

■ Brooks, C. 

a Characterizing and Evaluation Desktop Grids: An Empirical Study. CS2O03-O769, October 22, 2003 

• Brown. D. 

o BiickiiiH Free-Riders: Distributed Accounting and Settlement in Pccr-to-Peer Networks. CS2003-075 1 , June 24, 2003 

• Brown, S. 

□ AclivcCamnus - Sustaining Educational Communities ihrouuh Mobile Technology CS2002-0704, April 24, 20D2 
- Sustaininu Educational Communities through Mobile Technology. CS2002-07I4. July 8. 2002 



a Using Mobile TechnolouV lo C 



o Employing. User Feedbnck for Fast. Accurate. Low-Maintenance Geplocntioning. CS2Q03-O765, September 8, 2003 

• Burke. J. ' 

• Burkhnrd, W. 

o Uniform Hashing with Multiple Passbjls. CS2000-06SB, August 18. 2000 

° Rotational Position Oplimi7j.tio.n RPO> Dish Scheduling, CS2001-0679. July 16.2001 

o Improved Linux File System Hashing. CS200I-0680. July 16,2001 

• Buss.S. 

o Exponential Separation of Resfk) and Reslkt-l ). CS20P2-O697. January 1 1. 2002 

• Colder, B. 

o Rcducjnjj the Overhead of Compilation Delay. CS200D-0648, March 27, 2000 
o Reducing ORAM Power Using Compiler Assisted Refreshing. C52000-0649. April 2 1, 2000 
a Dynamic Selection of Compression Formats lo Reduce Transfer Delay. CS20OO-O650. April 21, 2000 



= Desipn Automation for Finite Si 



:t Efficient S 



e Predictors. C 



S, June 28. 2000 



■e Fetch Archileclur. CS2000-06S7, June 2fl. 2000 



3 Using Annotations lo Reduce Dynnmic Optimization Time. CS200D-0663. November 16, 2000 

3 Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. CS2Q0 1-0667, March 18,2001 

□ Predicting Region Branches Using Predicate Update Branch Prediction. CS200 1-0677, June 25, 2001 

□ Pnlchable Instruction ROM Architecture. CS200 1-0678. June 25. 2001 

3 Efficient Design Snace Exploration for Customized Processors. CS2001-0688. November 20, 2001 
3 Relicvinu, Register File and Instruction Window Pressure. CS200 1.-0689, November 20, 2001 



o Pjiasc_Tmcklna,and Predicliorf. CS2D02-07 ] 0, June 23. 2002 

o Optimized Trace Binaries for Architectural Evaluation. CS2002-07 1 1 , June 23, 2002 
o Pointer Cnchc Assisted Speculative PrecompulaUon. CS2002-07 12, June 23, 2002 
o Usinu SimPolnts in Diverse Simulation Environments. CS2002-0727. November 16,2002 
o Structures and Algorithms for Phase Classification. CS20O3-O757, July 29, 2003 

o A Co-Phnse Matrix to Quide Si^nullnneous M^uljhhreodinrt Simulation. CS2003-O771, October 28, 2003 



;s far Phase Classification. CS2003-0772, October 28, 2003 
o The Eniropia Virtual Machine fur Desktop Grids. CS2D03-O773. October 28. 2003 

o Code Pointer Protection From Buffer Overflow Through Targeted Hardware Encryption. CS2003-O774, December I, 2003 
o Buildiim a Hierarchy of Vnn'nble Length Intervals to Canlure Hierarchical Phase Behavior. CS2004^07B I.March 13,2004 
o Efficient Sampling Startup for Uniprocessor and Simultaneous Multithreading. Simulation . CS2004-0803. November 28, 2004 
o Selecting Software f'hn.ie Markers with Code Structure Analysis . CS2004-0804, November 28, 2004 
o Efficient Bounds Checking forC. C52004-OB05. November28.2004 

o Software Profilinu for Deierminisiic Replay DebuuEinn of User Code . CS2005-0839, October 18, 2005 

o Automatic Lonainw ofOpemtinu System Effects to Simplify Application-Level Architecture Simulation. CS2005-0840. Ociober 18, 2005 



o Cnmnnrinn Multinomial and K-Means Clustering for SimPoinl. CS2005-OB41. October 20. 2005 
o Efficient Hardware Support for Deterministic Replay Debugging ofMemcrv Races. Interrupts and SelfModifvin 
o DetccUnu Phases in Parallel Applications an Shared Memory Architectures. CS2005-0844, November 20, 2005 



j Code. CS2005-OB43, November 14, 2005 



peculative Architecture Structures. C 



3/1/07 2:08 PM 



jorts Listed by Author 



http://72.14.20y. 104/search7q=cache:5i3aIxEiXT8J:www.cs.ucsd.e... 



/ ( 

o Pattc-Bascd Transactional Mcmory-to Provide Fast Virtual Transactions, CS2005-OB48, December 1 8, 2005 
• Carter, U 

o Selection tile shape for minimal execution time. CS 1 999-06 16. May 20. 1999 
o DclermimiiHtfe idle time ofa tiling, CS 1999-0617, July 6, 1999 

a ProctforCnrrectncss for Sparse Tiliiw of Gnuss-Seidel. CS200I-0690. December 4. 2001 



p Compiler nnd Hardware Predicated D 



a PrtKifofCorrccm 
□ A Flow-based Task Si 



:v Analysis nnd Schetlullnn ■ C52CD2-0700. March 18, 2002 



m of Oauss-Seidcl. CS2003-074 1 . April 1 . 2003 

;n Strategy for Distributed Bvsiems. CS20Q3-0744, May 2. 2003 



• Casanova, H. 

a Heuristics far Scheduling Parameter Sweep Applications in Grid Environments. CS 1999-0632. Oclober 14. 1999 

a The Virtual Instrument: Support for Grid-enabled Scientific Simulations, CS2D02-0707, May 3 1 , 2002 

o A Modular Scheduling Approach for Grid Application Development Environments. C520O2-O70B, June 5, 2002 



c Parallel Parameter Searches: an E 



ll Study, CS2002-0720. August 18, 2002 



id Algorithm far Scheduling Divisible Workload Applications: Analysis and Experimental Evaluation . CS2002-0721. September 26. 2002 
o MP1 Process Swapping: Architecture nnd Experimental Verification, CS2003-0735. January 29, 2003 
o Extensions to the Multi-Installment Algorithm; Affine Costs and Output Data Transfers. CS2003-0754, July 16, 2003 
o Benchmark Probes for Grid Assessment. CS20Q3-0760. AukusI 1,2003 

o Chnrncteri?jnu and Evaluating Desktop Grids; An Empirical Study. CS2O03-O769. October 22, 2003 
o APST-DV: Divisible Load Scheduling nnd Deployment an the Grid. CS2004-O7S5. April 28, 2004 

' m I lie Oplfmnl Makuspan far Jobs with Identical and Independent Tasks Scheduled on Volatile Hosts. CS2004-0796. July 12. 2Q04 



> APST-DVrAPmeti 



tJbrS, 



M and Deploying Divi 



s. CS2004-0802, November 9,2004 



d Efficient Resource Description and High Quality Selection for Virtual Grids, CS2004-0809. December 17. 2004 
o The Virtual Grid Description Language: vuDL. CS2005-O817, February 18, 2005 

o NP-Compktcncssof thf Divisible Lcmd Scheduling Problem on Heterogeneous Star Plalforms with Affine Costs. CS2005-0BIB. March 10, 2005 



nam) Bind lnu , lor Competit ive Resource Environments, t 



s: A comparison of wo al 



s. CS200J.O669. April 26, 2001 



>e Web Transfer Times. CS200l-O6a3. September 1 1. 2001 



• Charles A. 

o Minimum-Buffered Roiilinu Of Npn-eritical Nels. CS2Q0I-O6B1. Aumist 14.2001 

• Chen,H. 

o The Y-archilucture: Vet Another On-Chip Interconnect Solution. CS2002-O717, August 7, 2002 

o A Multiple Level Network Approach for Clock Skew Minimization with Process Variations. CS2D03-0756. July 28. 2003 



n,Y. 



d Linux Pile System Has: 



&CS200I-0 



O.July I6.20DI 



- Cheng, C. 

o Hurwit?. Interconnect Delay Evaluation • HIDE: User's Manual. CS2000-O66 1 , November 9, 2000 
o Hurwte Inlercnnncct Delay Evaluation - HIDE: Programmer's Mnnunl. CS2000-Ofi62, November 9, 200D 



;r Network Reduction Via Generalized Y-S\DeltaS Transformation: Theory. CS2O02-O706, May 22. 2002 
o The V-nrchticcturc: Vet Another On-Chip Interconnect Solution. CS2002-O717. August 7, 2002 

o A Multiple Level Network Approach for Clock Skew Mlnlmizntion with Process Variations. CS2003-0756, July 28, 2003 

□ GRYD: Generalized Reduced-Order Wye-Dciln Transformation: Prtmmmmcr's Manual for Reduction Engine and Applications. CS2D03-O758. July 31. 2003 



D GRYD: C 



:r Wye-Dei in Transformnlion: User's Manual for R' 



m Engine and Applications. CS2O03-O759, July 3 1 , 2003 



o Fast Transient Simulation of Lossy Transmission Lines. CS2006-0874. December 13, 2006 

• Cheng, Y. 

o Jigsaw: Solving the Puzzle ofErUerprisc B02.U Analysis, CS2006-O849. February 21, 2006 

• Chien, A. 

a Characterizing and Evaluating Desktop Grids: An Empirical Study. CS20D3-O769, October 22. 2003 

a DGMonilon a Performance Monitoring Tool for Sandbox-based Desktop Grid Platforms. CB20O3-O770. October 24. 2003 

a The Entrnpia Virtual Machine for Desktop Orida. CS2003-0773, October 28, 2003 

o Understanding When Location-Hiding Usi ntt Overlay Networks is Feasible. 052004-0788. May 9, 2004 

o Evaluation of n High Performance Erasure Code Implementation. CS20O4-Q798, September 13, 2004 

o Efficient Resource Ocscripiion and Hiflh Quality Selection for Virtual Grids. CS2004-0809, December 17, 2004 

o The Virtual Grid Description Ungungc: vuDL, C52005-OBI7. February 18. 2005 

o Cnmhined Selection and Binding for Competitive Resource Environments. CSZ0D5-0B2S. May 18.2005 
a Pcer-lo-Pccr Error Recovery fur Hybrid Satcltitc-Tcrrcstrial Networks. CS200S-OB42. Oclober 3 1, 2005 



o RobuSTore: Robust Performance for D 



s 1 CS200fi-08SI.Morch 13.2006 



-n for Pain Gathering with Accuracy Requirement in Wireless Sensor Networks. CS2006-0854, April 5, 2006 



is for Federated Systems. C 



6-0865, August 28, 20D6 



• Chuaitn,, W. 

o Efficient Hounds Checking for C . CS2004-08Q5. November 28, 2004 

a Pnue-Based Tmnsactionnl Memory to Provide Fast Virtual Transactions, CS20D5-0B4B, December 1 8, 2005 
o Maintaining Snfe Memory for Security. Debugging, and Mulli-thrcadini;. CS2006-0B73, December 9, 2006 

■ Chun, G. 

o Benchmark Probes for Grid Assessment. CS2003-O760. August 1,2003 

• Chung. F. 

□ A Near-Optimal Algorithm for a Locality-Maximizing Placement Problem. CS2004-0777. January 16,2004 
- Cicotti, P. 

o DGMonilor: a Performance Monitoring Tool for Sandbox-based Desktop Grid Platforms. CS20O3-O770, October 24, 2003 

• Cimc. W. 

□ Application Schedulinu over Supercomputers: A Proposal. CS 1 999-063 1. October?. 1999 

o Comblninu Worbsinlinns and Supercomputers m Support Grid Applicniions: The Parallel Tomography Experience. CS20OO-0642,.JttPUBry 7. 2 

• Ckrintz, C. 

a Using Annnlations lo Reduce Dynamic Optimization Time. C52000-0663, November 16, 2D00 

■ Cohn, R. 

p Automatic l.oguing ofOpernting System Effects lo Simplify Applicolipn-Levcl Architecture Simulation. CS2005-OB4O, October IB, 2005 

■ Colnvin.O. 

o Pace-Based Transactional M emory to Provide Fast Virtual Trai 



3/1/07 2:08 PM 



jorfs Listed by Author 



http://72 . 1 4.209. 1 04/searchVq=cache:5i3 a I xHiXTKJ :www.cs.ucsd .e. . , 



f 



• Coskun, A. 

■ ; Z. " " " ~ ' T 18, 2005 

■ Cotlrell.G. 

o PCA ° Gab 

• Curlmoin.E, 

o Flexible and Efficient XML Search with Complex Full-Text Predicates, CS2005-OB45. December 12. 2005 
' X Queries Using Nested Views. CS2005-OB46, December 12, 2005 



It Detection Using Sncculativc Architecture Structures. CS2005-0847, 
(pressian Recognition. CS 1999-0629. October 26. 1999 



• Dnil.H. 



A Modular Fm: 



nch for Grid Application Devel 



a Benchmark Probes far Grid Assessment, CS2003-0760, August 1 , 2003 

• Doiley;M. 

. o PCA "Gohnr for Expression Recognition. CS 1999-0629, October 26, 1999 
■ Demchnk.B. 

o Diit.i Quality for Situational Awareness during Mass-Casualty Events. CS2006-OB57, April 1 1, 2006 

• Deutsch, A. 

o Query Set Specification LnneuoBc (QSSL). CS2003-0739. March 24, 2003 

o CLIDE: Interactively FormLlatinn Fens! Me Queries on Query Rewritinu-Bascd Systems. CS2004-OB07, December 12, 2C04 
o Flexible nnd Efficient XML Search with Complex Full-Texl Predicates. CS2005-0845, December 12,2005 
o Rewriting, Nested XMU Queries Usiim Nested Views. CS2005-0846, December 12, 2005 
m Duln-Drivcn Web Services. CSZD06-0a53. March 27, 2006 



a Para E<channe, Dam Integration, nnd Chase. CS2006-0859. April 26,2006 
o Privacy in GLAV Information Integration. CS2006-0B69. October 1 6, 2006 

• Dlagnekov, L, 

d Recognizing Cars. CS2005-O833. September 28, 200S 

• Dolcv, D. 

o A CUcnl-Scrver Oriented Algorithm Tor Virtually Synchronous Group Membership in WANs. CS 1999-0623. July 7. 1999 

• Donalo.G. 

o Approximation Methods Tor Thin Plate Spline Mappings and Principal Warps. CS2D03-0764, September 4, 2003 

• Dulonn.C, 

□ DctgcHrm Phases in Parallel Applications on Shared Memory Architectures. C52Q0S-O844. November 20. 2005 



• Eeckhout, L. 
. El ten, C. 



Startup for Uniprocessor and Si 



nu Simulation . C52004-OB03. November 28, 2004 



and Mnkinu, Decisions When Costs and Probabilities are Boll 



0,08200 1 -0664, 



2001 



d Sources of Success far Injhmmlioii Extraction Methods. CS2002-0696, January 7, 2002 
o Alternatives to llie k-nieans algorithm that tind better clusterings. CS2002-0702, April 3, 2002 
o Lcarninu the k in k-mcans. CS2002-O716, July 30, 2002 

■ Ellis. R. 

o JBIG Compression Algorithms for "Dummy Fillj* VLSI Layout Pain. CS2002-O709, June 14, 2002 
• Ellismon, M. 

□ Comhminn Workstations and Supercomputers to Support Grid Applications: The Parallel Tomography. Experience, CS2O0D-0642, January 7, 2000 
o The Virtual Instrument: Support for Grid-cnahlcd Scientific Simulating. CS2002-0707. May 31. 2002 

■ Eslan.C. 

o Scalable Measurement: Finding some Elephants in a Swarm of Ants. CS200I-0666. February 12, 20DI' 
o New dircclions in traffic measurement and accounting, CS2Q02-0699, February 8, 2002 
" in the number of active ilows on a Inch speed link. CS2002-0705, May 2 1 , 2002 



5 Bitmap algorithms force 



d links. C520D3-073B. March 13. 20D3 



□ Reai-iimc Detection ofKnown and Unknown Worms. CS2003-0745, May 30. 2003 
o Automatically Inferring Patterns of Resource Consumption in Network Traffic. CS2003-0746, June 2 

□ The Measurement Manifesto. CS2003-0747, June 4, 2003 

□ The EarlvBird System for Rcul-limc Detection of Unknown Worms. CS2003-0761. Attaint 4, 2003 



□ An Adaptive S 1 



s Comparison between 



n for Real-time S 



rs anujj!. 



;s ortnlemel Traffic. CS2003-Q766. September 24, 2003 



:s for flndinu heavy hitlers. CS20D4-0784. April 27. 2004 



e.CS2004-0793,June 15,2004 

a Accuracy Bounds For The Scaled Bitmap Data Structure. CS2005-08 19, March 22, 2005 
• Facrman, M. 

a Adaptive Performance Prediction for Distributed Data-Intensive Applications. CS1999-0619. May IB, 1999 
• o The Virtual Instrument: Support for Grid-ennhlcd Scientific Simulations. CS2002-0707. May 31.2002 

:c Allocation for Stecrahle Parallel Parameter Searches: nn Experimental Study. CS2O02-O72O, August 1 8, 2002 



• Ferronle.J. 

a Selcclinc tiles! 
o Determining thi 



» Predicting Region Branches Usinn Pretlfcotc Update Branch Prediction. CS200 1-0677, June 25, 2001 
a ProofofCnrrcclness lor Spares Tiling ofGnuss-Scidel. CS200I-0690. December 4, 200! 
a ProofofCorrcctiicss for Sparse Tiling of Oauss-Seldel, CS20O3-O74 1 , April 1 , 2003 



o A Flnw-bnscd Task Scheduling Strategy forD 



d Systems. CS2003-0744, May 2, 2003 



• Fisk, M. 

o Fast Content-Based Packet Handling for Intrusion Detection. CS2001-0670, May 7, 200 1 
o Counting the niimhcrofaclivu flows on n high speed link. CS2002-0705, May 2 1 , 2002 



o Bitmap algorithms for counting active Hows on high si 



;rf links. CS2D03-O73S, March 13,2003 



a Active learning for visual object detection, C 



is by indexing patterns. 052006-0858, April 26. 2006 



6-087 I.November 19,2006 



• Frcy.J. 

□ Combining. Workstations and Supercomputers to Support Grid Applications: The Parallel Tomography Experience. CS20Q0-0642, January 7. 2000 

• Geoffrey, M. * . 

o Coping with Dependent Process Failures. CS2002-0723, October 7, 2002 

• Ghosh, R. 

o Modifying Shortest Path Routing Protocols to Create Symmetrical Routes. CS2001-0685. September 11. 2001 



3/1/07 2:08 PM 



porfs Listed by Author 



httptf/72. 1 4 .209. 1 04/searcti?q=cacrte;5 i3 a 1 xEiXTBJ :www.cs. ucsd.e. . . 



-□ The Virtual Instrument: Support for Grid-enabled Scientific Simulations. CS2002-O707, May 31, 2002 

• Gosucn, J, 

d Circular Cuinductinn. CS200D-0647. March 14,2000 

o Critical Points Tor Interactive Schema Matching, CS2004-0779, January 30, 2004 

■ Covindnn, R. 

o The Impact cfAddrcss Allocation and Routine an the Structure and Implementation of Routing Tables. CS20Q3-O749. June 19,2003 

• Crnhnm, R. 

o A Ncnr-Opiimal Aluorithm fora Locality-Maximizing Placement Problem. CS2QQ4-0777, January 16, 2004 

• Griswold. W. 

o AspcclBrowser.Tcol Support Tor Managing, Dispersed Aspects, CSt999-0640. January 3. 2000 

o Tenchinp, Software Engineering in a Compiler Project Course. CS2000-0659, September 12, 2000 

o Exploiting the Map Metaphor in a Tool Tor Software Evolmion. CS2O00-O660. September 20. 2000 

o implementation Techniques for Efficient Data-Flow Analysis of Larue Programs. C52C01 -0665. February 3. 2001 

o A Comparative Sludv of Two Whole Program SI icers forC. CS2001 -0668, April 12,2001 . 

o The ActiveClass Project: Experiments in Encounminu Classroom Participation. CS20Q2-0703, April 24, 2002 

o AciiveCumnus - Suslnininii Educational Communities llimuuh Mohile Technology. CS2002-07Q4, April 24, 2002 

o AcliveCampus - Sustaining Educnlional Communities llimuuh Mobile Technology. CS2002-07 14, July B, 2002 

o The ActivuCtttss Project: Experiments in Enenurnuinn Classroom Participntion. CS2QO2-0715, July B, 2002 

a Using Mobile Technology to Create Opppnunilistic interactions on a University Campus, CS2002-0724, October 16,2002 

o AciiveCnmnua - Experiments in Community-Oriented Ubiquitous Computing. CS2003-O75D, June 24, 2003 

o Employing User Feedback I'ur Fast, Accurate, Low-Mainlunance Gcolocalinninu,. CS2003-076S, September 8, 2003 

o Tele-Reality for the Rest pf Us. CS20O4-Q77B, January 16. 2004 

o Fulcrum - An Open-Implementation Approach id Context-Aware Publish / Subscribe. CS2004-079I, June 8, 2004 

□ A Systems Architecture for Ubinuilous Video. CS20DS-0813. rebruarv 4. 2005 
o Harnessing Mobile Ubiquitous Video. CS2O05-O8I4, February 4. 2005 

o Place-Its: Locotion-Bascd Reminders on Mobile Phones. CS200S-O82O. March 23. 2005 

o Evaluating Location Based Reminders. CS2Q0S-O826. May 18,2005 

o A Scalable Canslune Course for Academic Preparation. CS2O05-0832, August 2B. 2005 

° A Robust Abstraction lor First-Person Video Streaming: Techniques, Applications, and Experiments. CS2006-0855, April 7, 2006 
a Dma Quality lor Situational Awareness durin H Mass-Casualty Events, CS2006-OB37. April 11,2006 

• Grove, D. 

o Reducing the Overhead ofComptlotion Delay. CS2000-0648, March 27, 2000 

• Gupta, R. 

o Usinu Program Phases as Mcta-Dala forRunlimc Energy Optimization. C52004-O797, July 14, 2004 

o Declaraltve Resource Naming for Macroprogramming. Wireless Networks of Embedded Systems, CS2004-O8O0, November 2, 2004 
a Declarative Resource Naming for Macroprogramming Wireless Networks of Embedded Systems. CS2005-O827, May 30, 2005 
o Online Learning Algorithms for Dynamic Power Mnmmctrtcnl, CS20O6-O856, April 8, 20D6 

■ Hnmerly, G. 

a Alternatives to the k-menns nlnoriihm that lind belter clusterings. C52002-0702, April 3, Z0D2 
o Learning the k in k-menns. CS20D2-O716, July 3D, 2002 

o Building a Hierarchy of Variublc Lenutli Intervals to Capture Hierarchical Phase Sehavior. CS2004-O7B 1 . March 13,2004 
■ Comparing, Multi nomial and K-Men ns Clustering fnrSimPoint. CS200S-OB41. October 20, 2005 

■ Hcvia.A. 

□ Coping with Inlcmcl cclastrophes. CS2Q05-OBI6. February 17,2005 

• Hiltunen, M. 

□ Pcer-to-Peer Error Recovery for Hybrid Satellite-Terrestrial Network!. CS2005-OB42. October 31. 2005 
o Customizable Service Slate Durability for Service Oriented Architectures. C520Q6-086 1 , July 1 3, 2006 

• HoKstedt, K. 

o Selecting tile shape for minimal execution lime. CS 1 999-06 1 6, May 20, 1999 
o Delcmiminwllm idle time ofn tiling, CS 1999-06 17. J.uly 6, 1999 

• Hohircld.M. 

□ Security in the Sanctuary System. CS2002-0731. December 20. 2002 

■ Horowitz, M. 

o Autnmntic Color Cnllhmiion for Large Camera Arrays. C52Q05-OB21, May 11,2005 

• Hu, T. . 

o Minimax Programs und nilnnlc Column Matrices. CS 1999-0624. June 17, 1999 
a Min Cuts Without Path Packing, CSI999-0625. June 17. 1999 
n One Dimensional Knapsack, CS2004-O775, January 14, 2004 

■ Hu. Y. 

o Optimized Trace Binaries for Architectural Evalunlion. CS2P02-O71 1. June 23, 2002 

• Huang, R. 

o Efficient Resource Description and High Quality Selection for Virtual Grids. CS20D4-O809, December 17, 2004 

o The Virtual Grid Description Language: vcDL. CS2005-0817. February 18.20D5 

o Failure-Resilient Expectations for Federated Systems. CS2006-OB65. August 28. 2006 

• Huny, E. 

o Agent Usage Patterns: Bridging, the Gnp Between Agent-Based Appltcnlions and Middleware. CSI999-063B. November 19, 1999 
o Agent Behavior Pallems in a Wireless Internet Environment. CSZQOl-0693, December 17. 2001 

• le, E. 

o BloSpike: Efficient search for hnmologtius proteins by Indexing patterns. CS2006-08SB. April 26, 2006 

• impagliazzo. R. 

o Bounded-Deplh Freite wlh Counting Principles Polynomially Simulates Nullstellensar/. Refutations. CS200I-O686. November 14, 2001 
» Exponential Separation of Rcs(k) and Rcs(k+1). CS2002-0697. January 1 1. 2002 

• lalnnogonwiwat. C, 

o Declarative Resource Naming for MacronrPKrammmn Wireless Networks of Embedded Systems. CS2004-0800, November 2. 2004 
o Declarative Resource Naming for Macroprogramming Wireless Networks of Embedded Systems. CS2005-O827, May 30, 2005 

• Jacobsen, K. 

o Group Meinberehln and Wide-Area Master-Worker Computniions. CS2002-072S. November 6, 2002 

• Jay. G, 

o ActivcCnmnus - Sustaining Educnllnnnf Communities through Mobile Technology. CS2002-0704, April 24, 2002 

o AetiveCnmnm - Sustaining Educational Communities through Mobile Technology. CS2002-07 14. July 8. 2002 

o Using Mobile Technology to Create Oppctrtunitistfc Inleraelions on a University Campus. CS2002-0724, October 16, 20D2 



3/1/07 2:08 PM 



ports Listed by Author http://72.J4.20y.IU4/search7q=cache;5i3alxEiXT!iJ:www.cs,ucsd.e... 



o Siniilc image Appearance MoBLireinent. CS20D6-0868. Oclober 13,2006 
Khi.N. 

□ Automatic Color Calibration Tor Larue Camem Arrays. CS20Q5-OB21, Mny 1 1, 2005 
o Sinulc limine Appearance Measurement. CS2006-OS6B, Oclober 13, 2D0fi 



a Synchronous Consensus for Dependent Process Failures. CS2Q02-0722. Oclober 3, 2002 

□ Coping with Dependent Process Failures. CS2QQ2-0723, Oclober 7,2002 

□ The Phoentx Recovery Syslem: Rebuilding From the ashes ornn Inlernet calatlrophe. CS2003-0732, January 13, 2003 

□ Connectivity In the South American Inlcmct, CS2003-0733, January 13, 2003 

□ Lower Bound on ihc Number of Rounds for Consensus wilh Dependent Process Failures. CS20O3-O734, January 1 3, 2003 
it Process Failures, CS20Q3-O737, February 18. 2003 



o Coping Willi Internet catastrophes. CS2005-O8I6. February 17.2005 

o Weak Leader Election in the receivc-omission failure model. CS2005^O829. June I, 

a Coterie availability in sites (extended version). CS20O5-OB3 1, July 27, 2005 



- "Dummy Fillt* VLSI Lovoul Dam. CS2Q02-0709, June 14,2002 



o A Placement Methodology for Global Interconnect Reduction and lis Impact on Performance. CS20D4-0801. Oclober 3 1. 2004 
o Enhanced Design Flow and Optimizations for Multi-Project Wafers. CS2005-0823, May 14, 200S 

o Feedlhrotmh Channel Effect on Wirclenath Distribution in the Presence ofObstaclcs, CS20Q5-QB35. October 1 2, 2005 . 

• Karanicolas, J. 

o Charade firing and Gvaluatinu Desktop Grids: An Empirical Study. CS2003-O769, Oclober 22, 2003 
■ Koto. Y. 

a Asnecl Browser: Tool Support for Managing Dispersed Aspects. CS 1993-0640, January 3. 2000 

• Kauchak, D. 

a Sources of Success for Information Estmclinn Methods. CS2002-0696, January 7, 2002 
' Kee, Y. 

o Efficient Resource Dcscrinlinn and High Quality Selection for Virtual Orids. CS2004-0809. December 17, 2004 
o The Virtual Grid Description Lnniumtw: vu.Pl-. CS2005-OB17. February 18,2005 

o Combined Selection uud Binding for Competitive Resource Environments. CS2O05-QB25, May 18, 2005 

• Keidar.l. 

o A Client-Server Oriented Algorithm for Virtually Synchronous Croup Membership in WANs. CS 1999-0623. July 7. 1999 
o Optimistic Virtual Synchrony. CS 1999-0634, November 9, 1999 



o Comhininn Wnrkslnlions and Supercomputers lo Support Grid Applications: The Parallel ToniOHmpliy Experience. CS2000-0642, January 7, 2000 

■ Keys.K. 

o An Adaptive System for Real-lime Summaries of Inlernet Traffic. CS2O03-0766. September 24, 2003 

■ Killinn.C. 

a The Overlay Network Content Distribulinn Problem. CS2005-0B24, May 18, 2005 
» Kompella, R, 

o The Power of Slicing in Inlemcl Flow Measurement. CS 200 5 -0B 22, May 13, 2005 

o EITicienl Cooperative Scheduling in 802.1 1 Wireless Networks. CS2005-0B30, July 7, 2005 

• Kondo, D. 

o Characterizing nnd Evaluating Desktop Grids: An Empirical Sludv. CS2003-0769, October 22, 2003 

o Computing the Optimal Mnkespnn for Jobs with Identical and Independent Tasks Scheduled on Volatile Hosts, CS2004-0796, July 12, 2004 

• Kreibich, C. 

a Aulomalic Protocol Inference: Unexpected Means of Identifying Protocols. CS2O06-0B50, February 2] . 2006 

• Krimz, C, 

o Reducing Ihc Overhead of Compilation Delay. CS2000-0648, March 27, 2000 

o Dynamic Selection of Compression Formats to Reduce Transfer Delay. CS2000-0650, April 21, 2000 
o Reducing Load Delay lo Improve Performance of InlemelrCompuijnB Programs. CS200 1-0672, May 25, 200 1 
» Landa, L. 

o One Dimensional Knansack. CS2004-0775. January 14, 2004 

□ Optimizing ihc Knapsack Problem. CS2004-O783. April 2, 2004 

o Sage Algorithms for Knapsack Problem. CS2004-O794. June IB. 2004 

• Lau.J, 

a Structures and Algorithms for Phase Classification. CS20D3-0757, July 29, 2003 
o Structures for Phusc Classification. CS2003-O772. October 28, 2003 

o Building a Hierarchy ofVariahlc Length Intervals to Capture Hierarchical Phase Behavior. CS2004-0781. March 13,2004 
o Selecting, Software Phase Markers with Code Structure Analysis , CS2004-0804, November 28, 2004 

• Lee,G. 

o Plac c-lis: Lucntion-Bascd Reminders on Mobile Phones. CS2005-0820, March 23, 2005 
« Leyrand, A. 

° Heuristics far Scheduling Parameter Sweep Applications in Grid Environments. CS 1999-0632, October 14. 1999 

a NP-Complcteness of the Divisible Load Schedulina Problem on HcleroBencpus Star Platforms with AfTine Costs. CS2005-08I8, March 10, 20D5 

o A Robust Abstraction for First-Person Video Streaming Techniques. Applications, and Experiments. CS2006-0855, April 7, 2006 
o Data Quality for Silualional Awareness during Moss-Casualty Events. CS2006-O857. April 11,2006 

■ Levchcnko, K. 

o Aulomalic Protocol Inference: Unexpected Means pfldcnlffyina Protocols. CS2O06-0B5O.' February 2i. 2006 

o Increinentnl Sparse Binary Veclor Similarity Search in High-Dimensional Space. CS20O6-OB66, September 26, 2006 

■ Levoy.M. 

n Automatic Color Calibration for Larue Camem Arrays, CS2005-OB21, May II, 2005 
» Li, K. . " 

o Place-Its: Loealion-Bnscd Reminders on Mobile Phones. CS2005-OB20, March 23, 2005 
•> Evaluating Lncalinn Rased Reminders, CS2D05-0826. Mny 18.2005 



3/1/072:08 PM 



ports Listed by Author hrtp://72J4/2UyM04/searchyq=cache:3i3alxhiiXi'HJ:www.cs.ucsd.e... 

• Lin, K. 

o Critical Points for Interactive Schema Mntchinu. CS2004-O779, Jommry 30, 2004 
" Un - M - 

o On the Resilience ofBroadcaslinu. Strategics in n Failure-Propogating Environment ■ CS [999-0610, July 6. 1999 



telional Gossip: Gossip in it Wide Area Network. CS 1999-0622. June 16, 1999 
Gossip versus Deterministic Flooding: Low Message Overhead and High Reliability Tor Broadcasting, on Small N 



o Maximum Instantaneous Power Estimation by Subgraph Coloring. CS200S-OB34, October 12, 2005 
o Charge-Matching Tail Approximation in n Piece-wise Lincnr-nnd-Exponenlial Function, CS2005-0836, October 12, 2005 
o NP-Complclencss nnd Approximation Schema of Zero-Skew Clock Tree Problem. CS2005-0B37, October 13, 2005 

• Logothciis, D. 

o Efficient Resource Description and HiBh Quality Selection Tor Virtunl Grids. CS2004-O809. December 17, 2004 
o l-nilura-Resillenl Expectations for Federated Systems. CS2006-OB65, August 2B, 2006 

• Mo,J. 

o AtttomtUic Protocol Inference: Unexpected Means of Identifying Protocols. 052006-0850, February 21, 2006 
d Incremental Sparse Binary Vector Similarity Search in High-Dimensional Space. CS2006-O866, September 26, 2006 
Ma.Z. " 

o Dynamic Power Awurc Packul Processing with CMP. CS2006-OB52, March 21, 2006 
□ Online Learning Aluorilhms for Dynamic Power Management. CS2006-0856, April 8, 2006 

■ Mahadevan, P. 

d Cone: A Dislrihulud Heap Appro 
o MobiNei: A Scnlnble Emulation 

• Mandoiu, 1. 

o Enhanced Design Flow nnd Optimisations Tor Multi-Proiecl Wafers. CS2005-0823, May 14, 2005 

■ Markov, I. 

d A Placement Methodology for Global Interconnect Reduction and [is Impact on Performance. CS2004-0801, October 31, 2004 

■ Martini, P. 

o Uniform Hashing with Multiple Passbits, CS2000-065B. AuBial 18,2000 



d A Client-Server Oriented Algorithm far Virtually Synchronous Group Membership in WANs. CS1999-0623. July 7. 1999 
o Optimistic Virtual Synchrony. CS 1999-0634. November 9. 1999 

□ Gossip versus Deterministic Flooding Low Message Overhead and High Reliability for Broadcasting on Small Networks. CSL999-0637, November 18, 1999 
o Scalable Cnusal Message Logging for Wide-Area Networks. CS2QOO-065I. April 21.2000 

a Scalable Causal Messime Logging for Wiile-Arca Enviromnenls. CS200I-O67I. May 24. 2001 
o Synchronous Consensus for Dependent Process Failures. CS2002-O722. October 3. 2002 
d Caning with Dependent Process Failures. CS2002-0723, October 7, 2002 

o Group Membership and Wide-Area Master-Worker Compulations. CS2002-0725, November 6, 2002 

o The Phoenix Recover System: Rebuilding from the ashes of an Internet catastrophe. CS2003-0732. January 13, 2003 

a Lower Bound on the Number of Rounds for Consensus with Dependent Process Failures. CS2003-O734, January 13, 20D3 . 

d Consensus for Dependent Process Failures. CS2003-O737, February 18,2003 

o On the Generalization ofn>l«*t. CS2003-0743, April 21, 2003 

o Detecting Malicious Routers. CS2004-07B9. May 24, 2004 

a Weak leader election fur receive-oinission process failures. CS2005-08 12, January 26. 2005 
a Coping with Internet catastrophes. CS2005-OB16. February 17,2005 
o Weak Lcoder Election in the rcceive-omissicn failure model. CS2Q05-OB29, June 1 , 2005 
a Coterie nvnilnbilliv In sites fcxtended version). CS2Q05-0B3 1, July 27, 2005 

o Cmminifoublc Service Stale Durability, for Service Oriented Architectures, CS2006-0S6I. July 13,2006 
> Mnsinl.S. 

' a Gossip wmu Deterministic Flooding: Low Message Overhead and High Reliability for Broadcasting, pn_Smpll Networks. CS1999-Q637. November 18. 1999 

• McCurdy. N. 

□ Tele-Reality for (he Rest of Us. CS20D4-0778, January 16, 2004 

a A Systems Architecture for Ubiquitous Video. CS20O5-OB 13, February 4, 2005 
o Hnmessitm Mobile Ubiquitous Video. CS2005-0B14, February 4, 2005 

o A Robust Abstraction for First-Person Video Streaming Techniques. Applications, and Experiments. CS2006-QBJ5, April 7, 2006 

• McNeil. M. 

p Access and Mobility of Wireless PDA Users. CS2004-O780, February 9. 2004 
» Micciancio, D. 

o Guessing Two Secrets with Small Queries. CS200I-0687. November 14, 200] 

• Miller, M. 

o The Virtual Instrument: Support fur Grid-enabled Scientific Simulations. CS2002-0707. Moy31,2002 

• Mizndc.A. 

o Detecting Malicious Routers. CS2004-07B9, May 24, 2004 

• Moore, D. 

o An Adaptive System for Real-time Summaries of Internet Traffic. CS2003-0766. September 24. 2003 
a Network Telescopes: Technical Report. CS20O4-0795, July 7, 2004 

• Mori, M. 

o A Multiple Level Network Approach for Clock Skew Minimisation with Process Variations. CS2003-0756. July 28, 2003 



o Multi-Language Support in n Program Analysis nnd Visualization Tool. CS2000-0655, June 20, 2000 

• Mysore, M. 

o FTP-M: An FTP-like Multicast File Transfer Application. CS2001-06B4. September 1 1.2001 

• Nam.Y. 

□ Critical Points for Interactive Schema Matching. CS2004-0779, January 30, 2004 

• Nnndy. S. 

o A Flow-based Task Scheduling Strategy for Dislribuled.Systenis. CS20O3-O744, May 2, 2003 

o Online Load Balancing and First-Hon Bandwidth Allocation in Public-Area Wheless Networks. CS2003-0748. June 10,2003 
. Nuroyan.R 

a Thc.tinnacl of Address Allncalion and Routing on the Structure and Implementation of Routing Tables. CS2003-0749. June 19,2003 

• Naraynnasamy, S. 



3/1/07 2:08 PM 



ports Listed by Author 



http://72. 1 4 .20y . 1 04/search7q=cache:5 i3 a I xEiXT8 J :www.cs.ucsd.e.. . 



L. CS2Q04-Q805, November 28. 2004 

o Software I'rufilinn for Deterministic Replay Dcbuniitne ofUscr Code . CS2O05-O839, October 1 B, 2005 

m ofOncmlinu System Efforts lo Simplify Appticntion-Levd Architecture Simulation. CS2005-Qg4Q. October 18, 2005 
11 Hardware Support for Deterministic Replay Debuuuing of Memory Races, Interrupts and SelfModifvinn Code. C52005-O843, November 14, 2005 
_1 _ " " " IK Speculative Architecture Structures. CS2OO5-0B47. December 16, 2005 

o Pauc-Bascd Transactional Memory to Provide Fnsl Virtual Transactions, . CS2005-Q84B. December 18, 2005 

■ Nash, A. 

o Dnln Exchange. Data Inlcnralton. and Chase. CS2006-0859. April 26, 2006 
o Privacy in GLAV Information Inlefirntt'on. CS200fi-0fl69, October 16,2006 
» OberteM, O. 

a The Virtual Instrument: Support for Grid-enabled Scientific Simulations. CS2002-O707, May 3 1 , 2002 

■ Ojhn. A. 

□ Security in [he Sancluarv System. CS2002-0731. December 20, 2002 

a Bucking Free-Riders: Distributed Accounttnn and Settlement in Peer-to-Peer Networks. CS2003-0751. June 24. 2003 

■ Onose.N. 

o Rcwriltnu Nested XML Queries Usinu Nested Views, CS2005-OB46, December 1 2. 2005 
• Orlitslty, A. 

o Scmf-paramclric 
a Supervised di 



k CS2O04-OS 1 1, December 27. 2004 

n far Customized Processors. CS2001-0688, T> 



rial Position Optimization iRPO) Disk Scheduling CS2Q0l-0fi79. July 16, 2001 



o Improved Linux Tile System Hashing. C 



o Query Set Specification 



;d XMLQucrua U 



12,2 



12,2004 

I?, 1999 



JKin n »hc Qap Between AjicnNOascd Applic a ii n r^ . a nd_jyAtd d Jgu ar e ; CS 1 999-0638. h 

o Limited Mtihila Acents: A Practical Approach. CS2000-064I, December 29. 1999 
o Dynamic Wch Stream Cusloini?.ers. CS2001-0691,* December 14, 200] 

° A Web Middlew are Architecture for Dynamic Customisation of Web Content Ibr Ncn-TmdiHonal Clients. CS2001-0692, December 14. 2001 
o Ancnl Behavior Patterns in a Wireless Internet Environment. CS200I-O693, December 17, 2001 
□ Using Network Flow ButTeriim to Improve Performance of Video over HTTP. CS2004-0776. January 14,2004 
o The Overlay Network Content Distribution Problem. CS2005-0824. Mny IB, 20D5 
Mil, H. 

o Auipjnnlic Lounimi nf Onernlinu System Effects lo Simplify Annlicnlion-Lcvel Architecture Simulation. CS2005-OB40. October 18,2005 
tv Transmission Lines. CS2006-fl874, December 13, 2006 



• Pereira, C. 

o Usinu Program Phases as Mem-Pain for Runtinu 1 Enerny Optimization. CS2Q04-0797. July I4.20D4 

° Automatic Lowing, ofOpcrntinii System Effects lo Sininlifv Application-Level Architecture Simulation. CS2005-O84O. October IB. 2005 
□ Efficient Hardware Support for Deterministic Replay DebunKinnofMemorv Races. Interrupts and Self- Modify in a: Code. CS2005-0843. November 14, 2 



o Using SimPoints in Diverse Simulation Environments. CS2002-0727. November 16,2002 
o Buildinn n Hierarchy of Variable Lcnuth Inlervnls to Capture Hierarchical Phase Behavior. CS2004-O7BI. March 13, 20D4 
o Selecting Software Pliuse Markers with Code Structure Analysis . GS2004-O804, November 2B, 2004 



m Phases in Parallel A 



• Petropoulos, M. 

o Query' Sel Specification Lnnguane (QSSLj, CS20D3-0739, March 24, 2003 

° GLIDE: Interactively Formulation Feasible Queries an Query Rcwritrnu-Bnscd Systems. CS2004-OBD7, December 12, 2004 
■ Poknm, G. ■ ■ 

□ Page-Based Transoclionnl Memory lo Provide Fnsl Virtual Transactions. CS20P5-QB48, December 18. 2005 

• Polilo, M. 

o Detection Phases in Parallel Applications on Shared Memory Architectures. CS20O5-O844, November 20, 2005 

• Polyzos, G. 

o Plane Cover Multiple Access: A New Approach In Maximizing Cellular System Capacity. CS2000-0654, May 28, 2000 

• PomeranncS. 

o The Virtual Instrument: Support fur Grid-enahled Scientific Simulations. CS2002-0707, May 3 1 , 2002 

• Qfn.2. 

o -lunvil/. Interconnect Delay Evaluation - HIDE: User's Manual. CS200Q-O66 1, November 9, 2000 

- - - ■ - . HIDE: Programmer's Manual. CS2000-0662. November 9. 2000 



° GRVD:C 



ir Network R 



n Via G 



d Y-S\DeItnS Tmnsformalipn: Theory. CS2002-Q706, May 22, 2002 



:d Reduced-Order Wyc-Dclla Transformation: Programmer's Manual for Redaction ETna.inc and Applici 



s, CS2003-075B. July 3 1 , 2003 



o GRYD: Generalized Reduced-Order Wye-Delta Transformnlion: User's Manual for Reduction Engine nnd Applications. CS2003-0759, July 3 1 , 2003 
• REmmel, J. 

a Datn Exchunue. Pain Integration, and Chase. CS2006-OB59. April 26, 2006 
0,052003-0763, August 19, 2003 
U. CS2006-OB70, October 24, 2006 



a Distributed Rale Limiting. CS2006-0870. October 24, 2006 



3/1/07 2:08 PM 



ports Listed by Author 



http://7Z.l4.ZUy.lU4/search7q=cache:5i3alxhiiX - lBJ:www.cs.ucsd.e... 



< 

a Efficient Cooperative Scheduline in 802. 1 1 Wireless Networks, CS2005-OB30, July 7, 2005 

■ Rangan, V. 

o Online Lund Balancinti and First-Hop Bandwidth Allocation in Public-Area Wireless Networks. CS2003-O74B. June 10, 2003 

• Ratio, M. 

d The AetiveClnss Projccl: Experiments in EncuuniHinti Classroom Participation. CS20Q2-O7O3, April 24, 2002 

a ActiveCamtiiis - ILxnerlmcnls in Cnniinuuilv-Oricntgd Ubiguitgus Computing CS2003-07SD, June 24, 2003 

• Redo, S. 

d A Placement Methodology Tor Global Interconnect Reduction rmd lis Impact on Performance. CS2004-O8OI, October 31. 2004 

■ Rcinman, O. 

o A Power Efficient Speculative Fetch Archilectur. CS2000-0657. June 28. 2000 
o Hardware Onliinlaitlons Enabled bv a Decounled Fetch Architecture. CS2C01-0676. June 22. 2001 



e. CS2QOI-06B9, November20, 2001 

LCS 1 999-06 10, July 6, 1999 

■ Rodriguez, A. 

o MobiNcl: A Scalable Emulation Infrastructure for Ad Hoc and Wireless Networks. CS2004-0792. June 14. 2004 

• Rogaway, P. 

o Encode-l hen-encipher encryption: How to exploit nonces or rcdundacv in plaintexts for efficient cryptopraphv. CS2000-0646. March 6,2000 

• Rosu. G. 

o Proofs on Snfctv forUnlrustcd Code. CS 1999-0633. October 27. 1999 

o Circular Conduction. CS2000-O647. March 14, 2000 

o Ahiiraei Semantics Tor Mudule Composilion. CS200Q-£653, May fl, 20DO 

• Sair. S. 

o A Decoupled Predictor-Directed Si ream Prefclchirw Architecture. CS20OI-O694, December 17, 2001 

° Phase Tracking and Prediction, CS20D2-O7I0. June 23, 2002 

o Optimiy.ed Trace Binaries for Archileciuml Evaluation. CS2002-O71 1 . June 23. 2002 

o Painter Cache Assisted Speculative Precomnuuilimt, CS2Q02-07I2. June 23. 2002 

o Prcdiclor-Dircclcd Data Prefetch™ a far Pointer-based Atffllicntions. CS2003-O753. June 25. 2003 

• Sojnma, S. 

o Semi-parametric cxponcniial family PCA LgcducinK dimensions via non-paramctric latent distribution estimation, CS20O4-O790, June 2, 2004 
o Supervised dimensionality reduction using mixture models. CS2004-0810. December 27, 2004 

Detectinu Phases in Parallel Applications on Shared Memory Architectures. CS200S-0844. November 20. 2005 
il Memory to Provide Fast Virtual Transactions, 052005-0348. December IB. 2005 

i pattern entropy. CS2004^081 1, December 27. 2004 

o Rcducinu lite Overhead crComnilntion Delay. CS2000-064S. March 27, 2000 
» Savage, S. 

a Replication Strnlcuics Tor Hiuhly Availuhle I'eer-io-nccr StPrtme Systems. CS20D2-0726. November 6, 2002 



o Renl-limc Detection nf Knt 



in: Kcbuildinn fron- 



il calaslruphe. CS2003-0732, January 13, 2003 



□ Automatically Inrcrrinn Patterns o 



3-0745, May 30, 2003 



a Consumption in Network Traffic. C52003-0746. June 2, 2003 



o The Earlyllird Svslcin forR 



A Near-Optimnl Aluorilhm fan 



i Accounting and Settlement in Peer-lD-Peer Networks. CS20O3-O7S I, June 24, 2003 
i of Unknown Worms. CS20D3-0761, August 4, 2003 

t Pmhlcm. CS2004-O777. January 16, 2004 



o Detectinu Malicious Routers. CS2004-07B9. Mny 24, 2004 
o Nclwnrk Telescopes: Technical Report. CS2004-0795, Juty 7, 2004 
o Accuracy Bounds FnrThc Scaled Bitmap Data Structure, CS2005-08I9. March 22, 2005 
o JiasBW. Solving the Puzzle ofEntemrise BQ2.I I Analysis. CS2006 -0849. February 21 , 2006 
o Automatic Protocol Inference: Unexpected Means oridcntffvinc Protocols. CS2O06-OB3O. February 21, 2006 

• Schlichling, R. - 

a Peer-io-Pecr Error Recovery Tor Hybrid Satellilc-Tcrrestrinl Netwo'rks. CS2005-O842, October 3 1 , 2005 
a Customizable Service Slate Durability for Service Oriented Architectures. CS2006-0861, July 13, 2006 

a Structures and Algorithms for Phase Clossincation. CS20Q3-O757. July 29, 2003 
o Structures for Phage Classification. CS2003-0772, October 28, 2003 

• Scott. J. 

a Place-Its: Locnlion-Boscd Reminders on Mobile Phones. CS2005-O82O, March 23, 20O5 

• Seyerlind, N. ' 

o Proofs on Safely for Untrustcd Code. CS 1999-0633, October 27. 1999 

o Bounded-Dcrith Frege with Counting Principles Polvnoniially Simulates Nullslellcnsnlz Rcfulnlions. CS20QI-O6B6. November 14. 2001 
□ Guessing. Two Secrets with Small Queries. CS200 1-0687, November 1 4, 2001 
o Exponential Senamtfon ofttcsfkinnd Kes(k+I). CS2002-0697. January 1 1, 2002 
■ Sejnowski, T. 

a The Virtual Inslrumenl: Support forCrid-cnahled ScientiiTc Simulations. CS2002-0707, May 3 1 , 2002 



is in Conimunitv-0 



• Shannon, C. 

o Network Telescopes: Technical Report. CS2004-0795. July 7. 2004 

• Shapiro, R. 

o AclivcCnmpus-Sllsilijni 
o Active 



d Ubiquitous CompiUlnR. CS2003-O750, June 24. 2003 



m Mobile Technnlouv to Create Opportunilistie In 



u University. Campus. CS2002-O724, October 1 6. 2002 



t> AcliveCiimmis - Experiments in Commtiiiilv-Oriciilcd Ubinuilous Comnulinu. CS2003-O7S0. June 24. 2003 
herwood, T. 

o Reducing DRAM Power Using Compiler Assisted RefrcshinH. CS2000-0649', April 21 , 200D 
o Desiun Auloination for Finite Stale Machine Predictors. CS20D0-O656, June 28, 2000 

o Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. CS200I-0667. March 18,2001 



3/1/07 2:08 PM 



jorts Listed by Author 



http://72. 1 4 .209 . 1 04/search?q=cache:5 i3a 1 xEi XT8 J : www.cs.ucsd.e .. . 



o Phase TrnckinB and Prediction. CS2002-O710, June 23, 2002 

o Optimized Trace Binaries for Architectural Evaluation. CS2002-07 1 1 , June 23, 2002 

o Usinu SimPoinls in Diverse Simulation Environments. CS2002-O727, November 16, 2002 

o Annliculion-Tuitcd Processor Architectures. CS2003-Q752, June 25, 2003 

o A Co-Pliiisc Matrix lo Guide Simultaneous Multithreading Simulation, CS2003-O77 1. October 2B. 2003 

o BuildinR n Hierarchy of Variable Length Intervals lo Capture Hierarchical Phase Behavior. 052001-0781, March 13, 2004 

• Shiny, M. 

o Min Cuts Without Path Packinu. CS 1999-0625. June 17, 199? 
o One Dimensional Knapsack. CS2004-O775. January 14. 2004 

• Sicvcn. O. 

o MPI Process Swopping Archhcclure and Experimental Verification. CS2003-0735, January 29, 2003 



3 Pretl 



u Hciiion Branches Ust'nn Predicnte Updnli 



a. CS2001-0677, June 25, 2001 



o Tnrninc Predicate Information lo Advantage lo Improve Compiler Scheduling and Branch Prediction. CS2001-06B5. December27. 2001 
• Singh, S. 

o Packet Classification for Core Homers: Is there an alternatives lo CAMs?. CS2002-O719, August 7, 2002 
o HVi'EHCUTS: A Decision Tree Based Ajjrgrithtri for Fasl Packet Classification. CS2002-0730, December 12, 2002 
o Packet Class! ficai inn Usim- Multidimensional Cutting. CS2003-O73S, February 7, 2003 

3 incrcasfnu Obiccl Visibility in Decentralized Unstructured Pccr-To-Pccr Ncl works Using Content Based Routing, CS2003-0740, March 2B, 2D03 



°I!uL! 



;e for ISP depluvinenl of suncr-ni-crs in P2P networks. CS20Q3-0742, April 1 5, 2003 



□ Accuracy Bounds For The Scaled Bitmnp Dotn Structure. CSZ005-OBI9. March 22. 2005 



s lo Support Orid Applici 



s: The Parallel Tniiinurnpliv Experience. CS2000-0642, January 7, 2000 



• Smith, I, 

o Place-Its: Locnlion-Bnsed.Rcminders on Mobile Phones. CS2005-OB20. March 23, 2005 
■ Snavcly, A. 

o Benchmnrk Probes for arid Assessment, CS2003-O76O. August 1,2003 ' 

• Snoeren, A. 

□ The case for ISP deplovmenl ofsuper-peers in P2P networks. CS20D3-0742. April 15,2003 

□ The Overlay Network Content Distribution Problem. CS2005-0B24. May 18,2005 

a Jigsaw: Solving the Puzzle of Enterprise B02. 1 1 Analysis. CS2006-0849. February 21. 2006 
a Distributed ApplicaiipnManBuemenl Using Plush. CS2006-0864. July 31. 2006 
o Distributed Rate Ljmfjjng. CS2006-0870, October 24. 2006 



° EfTtcictit Cooricnuivi; Scheduling ii 
□hn,T. 

o Plnce-lts: Location-Based Remindg 



802, 1 1 Wireless Networks. CS2005-OS30, July 7, 2005 
le Phones. CS2005-OB20, March 23, 2005 



3^32005-0826, May 18,2005 



o Evalualin 

• Star, L. 

a The AcliveClnss Project: Experiments in Encouraging Classroom Participation. CS20D2-0703, April 24, 2002 

• Star.S. ' " — - 

□ The AcliveClass Project: Experiments in Encouraging Classroom Participation. CS2002-07 IS, July B, 2002 

• Steinberg, J. 

o Limited Mnhife Agents: A Practical Approach. CS2000-Ofi41. December 29. 1999 
o Dynamic Web Stream Cusiomizers. C52001-0691. December 14. 2001 

o. A Well Middleware Architecture Tor Dynamic Customization of Web Content for Nori-Troditionnl Clients. CS200I-0692. December 14,2001 
a Usiim Network Flow Bungling to Improve Perlbrmance of Video pver HTTP. CS2004-0776. January 14, 2004 

• Slrout, M. 

o Proof of Correctness for Sparse Tiling nf Gnuss-Seidcl. CS200I-0690. December 4. 2001 
o Proof of Correctness Tor Sparse Tiling, of Gauss-Scldcl. CS2003-0741. April 1,2003 

■ Su, A. 

o Adaptive Performance Prediction for Distributed Dala-Imensivc Applications. CS 1 999-06 1 9. May IB. 1999 

• Su, M. 

t> Combining Workstations and Supercoinptilers to Support Grid Applications: The Parallel Tomotrrtmhy Experience. CS2000-0642, January 7, 2000 

■ Surjihnrn, R. 

o An Efficient General-Purpose Mechanism for Data Gatherina with Accuracy Requirement in Wireless Sensor Networks. CS2006-0854, April 5, 2006 



u Vcrificnti 



^€52006-0853,1- 



:h 27, 2006 

o A Client-Server Orienied Algorithm for Virtually Synchronous Group Membership in WANs. CS1999-0623, July 7, 1999 
a Optimistic Virtual Synchrony. CS 1999-0634. November 9. 1999 
• Tali, K. 

a Interaction of Virtual Machine with the Operntinu System. CS2Q02-O728. December 2. 2002 
- ~ • ■ " it. Son State To Improve DHT Routing CS2006-0S62, July 27, 2006 
' — ....... — 3, July 27, 2006 



• Teixeira, R_ 
■ TowerTB! 

o Dneklttit Uinicnl hierarchies: A comparison oflwo algorithms for reconciling keyword structures. CS2001-0669. April 26, 2001 
« Truong.T. 



of 12 



3/1/07 2:08 PM 



pprts Mstea oy nutnar 



nnp;///z.i4.2uy.iU4/searcn7q=cacne:3iiaixbiAiHJ:www.cs.ucsa.e.. 



o The AcliveClass Project: Experiments in Encouraging Classroom Participation. CS2002-0703, April 24, 2002 

o ActivcCnmpus - Sustaining Educational Communities through Mobile Technology. CS2002-0704, April 24, 2002 

o ActivcCnmpus - Suslaininu Educational Communities throunh Mobile Technology. C52Q02-07 14. July 8. 2002 

e The AcliveClass Project: Experiments in Encouraging Classroom Participation. CS2002-07I5, July 8, 2002 

a Usinu Mobile Technology w Create Opportunittsiic Interactions on a University Campus, CS2002-0724. Oclober 16,2002 

o ActivcCampu.'; - Experiments in Community-Oriented Ubiquitous Computing CS2O03-O750, June 24, 2003 

• Tuck, N. 

o Code' Pointer Protection From Buffer Overflow Through Tamcled Hardware Encryption. CS2003-0774. December 1.20D3 

• Tucker, P. 

o Minimus Programs and Phonic Column Matrices. CS 1 999-0624. June 17, 1999 
o Min Cuts Withuul Pnili Packinu. CS 1 999-0625, June 17, 1 999 

• Tutlsen.'D. 

a Pornler Cliche Assisted Speculative Prccomnutntinn. CS2002-07I2, June 23, 2002 

v A New Direction in Tree Based Search Engine Architectures Usinu Balanced Sinnle Port Memories, CS2004-O799, October 1 5, 2004 

• Tune, E. 

a Critical-Path A wore Processor Archileclures, CS2004-O8O8, December 16,2004 

• Tuttle.C. 

a Distributed Armlicntion Management Usinu Plush. CS2006-fJ864, July 3 1 , 2006 

■ Uycdn.F. 

o Evaluation or a Hiuh Performance Erasure Code Implementation. CS2004-0798. September 13,2004 

■ Vahdal.A. 

a MohfNel: A Scalable Emulation Infrastructure for Ad Hoc and Wireless Networks. CS2O04-O792. June 14,2004 

o Declnralivc Resource Naming for Macronrogmminini; Wireless Networks of Embedded Systems. CS2004-080O. November 2. 2004 

o The Overlay Network Content Distribution Problem. CS2005-0824, May 18, 200S 

° Declarative Kcsotirce Numinu for Macroproummminu Wireless Networks of Embedded Systems, CS200S-0827, May 30, 2005 
o Distributed Application Management Usinu Plush. CS2006-0864, July 3 1 . 2006 

■ Vaish,V. 

. o Automatic Color Calibmlton for Large Camera Arrnvs. CS2005-OB21. May 1 1, 2005 
« Vnishnmpayan, V. 

o Pcer-m-Pcer Error Recovery for Hybrid Satellite-Terrestrial Networks. CS2005-OB42, October 3 1 , 2005 

• VanB. 

o Usinu SfinPoinlg in Diverse Simulation Environments. CS20Q2-0727, November 16, 2002 

o A Co-Phase Matrix In Guide Simultaneous Multithreading Slmulntlnn. CS2003-077 1. October 28. 2003 

a Eflicient SmnplinM Startup for. Uniprocessor nnd Simultaneous Multithreading Simulation . C52004-OB03, November 28, 2004 
o PaRC-Bnscd Transactional Memory to Provide Fnsl Virtual Transactions. CS2005-0848, December IB, 2005 

• Varghese, G. 

o Scalable Measurement: Finding some Elephants in a Swarm of Ants. CS200I-0666, February 12, 2001 

o Past Comctu-Sitsed Packet Handling, for intrusion Detection. CS2001-O67O, May 7, 2001 

a AHKrcuniod Bit Vector Search Altiorithms for Packet Filter Lookups. CS200I-0673. June 3. 2001 

o Automatically Downloading linages to Improve Web Transfer Times. CS2Q01-0683. September 1 1. 2001 

o Ffl'-M: An FTl'-like Multicast Pile Transfer Application. CS2Q0I-Q684. September 11.2001 

° ModifvinB Shortest Path Routing Protocols to Create Symmetrical Routes. CS20O1-O685. September 1 1, 2001 

o New directions in traffic measurement nnd accounting, CS2002-0699, February 8, 2002 

a Counting the number ofactive (lows on n hfch speed link. CS2002-0705. May 21. 2002 

a Fast nnd Scalable Conflict Detection for Packet Classifiers. CS2002-07I8, August 7. 2002 

o Packet Classification for Core Routers: Is there nn alternatives to CAMs7. CS2002-O7 1 9, August 7, 2002 

° HVPERCUTS: A Decision Tree Based Algorithm for Fast Packet Classification, CS20O2-O73O, December 12, 2002 

a Packet Clnssiricalinn Using Multidimensional Cuttin g CS2003-0736, February 7, 2003 

o Dilnlnn algorithms fur counting active Hows on high speed links. CS2003-073B, March 13, 2003 

o Real-time Detection of Known nnd Unknown Worms. CS20O3-O745, May 30. 2003 

£> Automatically InlerrinH Patterns of Resource Consumption in Network Traffic. CS2003-0746, June 2, 2003 
o The Measurement Manifesto. CS2003-0747. June 4. 2003" 

o The Impact of Address Allocation and Routing on the Structure and Implementolinn ofRoutine Tables. CS20O3-O749. June 19, 2D03 
o Coiie:Aumnunlinu DUTs to Sunnort Distributed Resource Discovery. CSZ003-0755. July 2 1 . 2003 
o The EnrlvBird System for Real-lime Detection of Unknown Worms, CS20O3-O761, August 4, 2003 
o Cade Pointer Protection From Buffer Overflow Through Targdcd Hardware Encryption. CS2003-0774, December I, 2003 
o Cone: A Distributed Heap Approach to Resource Selection. C52004-07B2. March 22. 2004 
o Accuracy Bounds For The Scaled Bitmap Data Structure. CS2005-0BI9. March 22, 2005 
« Venkatesh,G. 

□ Page-Based Transactional Memory to Provide Fast Virtual Transactions. CS2005-O848, December 1 8, 2005 

• Vianu.V. 

o Verification orCommunicating Daln-Driven Web Services. CS2006-0853. March 27. 2006 

• Vishwanalh, K, 

o Distributed Rule Limiting CS2Q06-0870, Oclober 24, 2006 

• Viswannthan, K. 

o Limit results on pattern entropy. CS2O04-OBI 1, December 27, 20D4 

• Voelker.G. 

o Replication Strategies for Highly Available Peer-io-peer Storage Systems. CS20O2-O726, November 6, 2002 
o Inlemelion of Vinunl Machine with the Operating System. CS2002-O728, December 2, 2002 
o Whole Puue Performance. CS2002-0729. December 16, 2002 

o The Phoenix Recovery System: Rebuilding from the ashes of an Internet catastrophe, CS2003-O732. January 13, 2003 

o Online Load Balaneinu nnd First-Hop Bnndwidlh Allocation in Public-Area Wireless Networks. CS2003-074B, June 10,2003 

o Conc:AugmcntiiiR DHTs to Support Distributed Resource Disco very. CS2003-0755. July 21, 2003 

o A Near-Optimal Aluorilhm for a Locality-Maximizing Placement Problem. CS2004-O777, January 16, 2004 

o Access nnd Moblliiv of Wireless PDA Users. CS2004-O7B0. February 9. 2004 

a Cone: A Dlslrihmed Heap Approach to Resource Selection. CS2004-07B2, March 22, 2004 

a Network Telescopes: Technical Report. CS2004-O795, July 7, 2004 

o Conine with Internet catastrophes. '052005-0816. February I7.200S 

o J itsnw: Solvinu the Puzzle of Enterprise 802- 1 1 Analysis. CS2006-OB49. February 2 1 , 2006 

o Automatic Protocol Inference: Unexpected Means of Identifying Protocols. CS2C06-0850, February 21, 2006 

=■ Shortcuts: Using Sofl Slate To Improve PUT Routine. CS2006-0862, July 27, 2006 



of 12 



3/1/07 2:08 PM 



ports L-istea Dy Autnor 



nnp:///z.i^.zuy.iut/searcnYq=cacne:3uaixE.iAi sj:www.cs.ucso.e.. 



( 



o Resource Reclamation in Distributed Hash Tables. CS2Q06-O863. July 27. 2006 

• Vmblc, M. 

o The Overlay Network Content Distribution Pmblem. CS2005-0824. May 18,2005 

• Wang, G. * . 

a Critical Points for Interactive Schema Matchinu. CS2004-O779, January 30, 2004 



o Undersmndinp When Locnlion-Hidinn Using Overlay Networks is Feasible. CS2004-0788, May 9. 2004 
- Weinje. E. 

o Pccr-to-Pccr Error Recovery far Hybrid Satellite-Terrestrial Networks. C52005-0842, October 31, 2005 
■ Wilburrt, B. 

• wiiii.t: ^ ' y 

a Structure from Periodic Molian. CS20Q3-0767, October 1 0, 2003 

o A Fcature-hased Armrpach far Delerniinine, Dense Lonit Ranee Correspondences. CS2003-0768. October 20, 2003 

• Wolski, R. 

o Adaptive Performance Prediction for Distributed Datn-lnlcnsivc Applications, CS1999-06I9. May 18, 1999 

o CombininR Workstations and Supercomputers to Support Grid A pp licnjions: T h^ara llel Tqm oisrephy. Ex neri F . rtee. CS2000-O642. January 7, 2000 
□ Application Schcdulhm an the Information Power Grid. CS20DO-0644, January 1 1 , 2000 



° The Virtual Instrument: Support for Grid-enabled 5- 



h CS2002-0707, May 3 1 . 2002 



h-Dimcnsional Space. CS2006-0866, September 26, 2006 

a Enhanced Dcsinn Flow ond Optimizations for Multl-Prolcct Wafers. CS2005-0823, May 14, 2005 

■ Yang, D, 

o The Enimpin Virtual Machine for Deskldo grids, CS2003-0773, October 28, 2003 

■ Yang, X. 

o Hurwitt Interconnect Delay Evaluation - HIDE: User's Manual. CS2000-O661. November 9. 2000 
. Yatte, Y, 

o A Mulli-Htiiind Alttorilhrn InrSchcdulinu Divisihlc Workload Applications: Analysis and Experimental Evaluation . CS2002-O72 1, September 26, 1 
' is lo the Multi-Installment Algorithm: Affine Costs and Output Data Transfers. C52003-0754. July 16. 20D3 



a AFST-DV: Divisible Load Scheduling ni 



o APST-DV: A P[ 



3 NP-Con 



ss of the Divisible Li 



k forSchcdul 



mtheG 



:r9, 2004 



• Yac.B. 

a A Multiple Level Network Approach for Clock Skew Minimization wilh Proi 

• Yee, B. 

a Security in the Sanctuary System, CS20O2-O73 1 , December 20. 2002 

• Yocum, K. 

o Fnilurc-Rcsilieni Expectations for Federated Svslcms. CS2006-0B6S. August 28, 2006 
o Dislribuled Rale Limiting. C52006-OB7Q. October 24, 2006 

• Young, S. 

o ComhiniTiu Workstations and Sunercmn nulcrs to Support Grid Applicn 

• Yuon.J. 

o AspuctBrowscr: Tool Support for Mnnntrimt Dispersed Aspects. CS 1999-0640, January 3, 2000 
o Exploittnmhe Mao Metaphor in a Tool for Software Evolution. CS200D-Q660. September 20. 2000 

■ Zadrozny, B. 

□ Lea rn jnB. itnd MakmH Decisions When Costs nnd Prohnbllities are Both Unknown. CS2001-0664. January 2. 2001 

■ Zagorodnov, D, 

d Heuristics for Scheduling Parameter Swccn Applications in Grid Environments. C5 1999-0632. October 14, 1999 
o Application Schedulinn on the Information Power Grid. CS2O0O-O644, January 1 1, 2000 



CS2005-08 18, March 10, 2005 
CS2003-0756, July 28. 2003 



s: The Parallel Tomography 



wtind Optimizations for Muttt-Pmicct Wafers, CS2005-0B23, May 14, 2005 
i. CS20D4-OB 1 1 , December 27. 2004 



■ Zelikovsky, A, 

o Enhanced Desiitn FJoi 

■ Zhang, J. 

d Limit results cn 

• Zhang, W. 

o Dynamic I'uwer Awure Packet Proccssinu with CMP. CS2006-0B52. March 21. 2006 
o Evcnl-Drivcn Multithreaded Dvanmic Oplimizalion. CS2006-0872. December 9, 2006 

• Zhang, X. 

" Grtnin Membership nnd Wide-Area Master-Worker Compulations. CS20O2-O725, November 6, 20D2 
n CusmmiMthfe Service Slate Durability for Service Oriented Architectures. CSZO06-O861. July 13,2006 



o Incremental Sparse Binary Vector Similarity Search in Hiuh-Dirr 



e i CS2006-O 



• Zheng, Y. 

o JB1Q Compression Aluorilhms for "Dummy Filll± VLSI Layout Data. CS2002-0709, June 14,2002 

• Zhou, D. 

o Verification of Communicalinu Dnta-Driven Web Services. CS2006-0BS3, March 27, 20D6 

CS2002-071 7, August 7, 2002 




Of 12 



3/1/07 2:08 PM 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 1 1 

TO DECLARATION OF 
JAMES A FLIGHT 



USPTO 



Subscribe (Full Service) Register (Limited Service, Free) Login 

Search: O The ACM Digits. _ibrary ® The Guide 



THE OUIOB TO COMPU t*!NG UTERATUR6 : 



I * Feedback Report a problem Satisfaction 



Automatically characterizing large scale program behavior 

Full text Hedf. (1.54 MB) 

Source ACM SIGOPS Operating Systems Review archive 
Volume 36 , Issue 5 (December 2002) table of contents 
SESSION: System performance and optimization table of contents 
Pages: 45-57 
Year of Publication: 2002 
ISBN:1-5B1 13-574-2 
Also published In ... 

Authors Timothy Sherwood University of California, San Diego 
EreZ P ere I man University of California, San Dlago 
Greg HamerlV University of CaFlfornla, San Dlago 

Brad Calder University of California, San Diego 

Publisher ACM Press New York, NY, usa 

Additional lnformation: abstract references cited by collaborative colleagues peer to peer 
Tools and Actions: Find similar Articles Review this Article 

Save this Article to a Binder Display Formats: BibTex EndNote ACM Ref 



DOI Bookmark: 



Use this link to bookmark this Article: http://doi.acm. orq/1 0. 1 145/635508.605403 
What is a DOI? 



* ABSTRACT 

Understanding program behavior is at the foundation of computer architecture and program 
optimization. Many programs have wildly different behavior on even the very largest of scales (over 
the complete execution of the program). This realization has ramifications for many architectural and 
compiler techniques, from thread scheduling, to feedback directed optimizations, to the way 
programs are simulated. However, In order to take advantage of ti me-varying behavior, we must 
first develop the analytical tools necessary to automatically and efficiently analyze program behavior 
over large sections of execution. Our goal is to develop automatic techniques that are capable of 
finding and exploiting the Large Scale Behavior of programs (behavior seen over billions of 
instructions). The first step towards this goal is the development of a hardware independent metric 
that can concisely summarize the behavior of an arbitrary section of execution in a program. To this 
end we examine the use of Basic Block Vectors, We quantify the effectiveness of Basic Block Vectors 
in capturing program behavior across several different architectural metrics, explore the large scale 
behavior of several programs, and develop a set of algorithms based on clustering capable of 
analyzing this behavior. We then demonstrate an application of this technology to automatically 
determine where to simulate for a program to help guide computer architecture research. 



T> REFERENCES 

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has 
opted to expose the complete List rather than only correct and linked references. 

1 A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. Journal of 
Computational Biology, 6:281-297, 1999. 

2 Christopher M. Bishop, Neural Networks for Pattern Recognition. Oxford Unive rsity Press, 
Inc.. New York, NY, 1995 



Automatically Characterizing 
Large Scale Program Behavior 



Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder 

Department of Computer Science and Engineering 
University of California, San Diego 

{sherwood,eperelma 1 ghamerly,calder}@cs.ucsd.edu 



Abstract 

Understanding program behavior is at the foundation of 
computer architecture and program optimization. Many pro- 
grams have wildly different behavior an even the very largest 
of scales (over the complete execution of the program). This 
realization has ramifications for many architectural and com- 
piler techniques, from thread scheduling, to feedback directed 
optimizations, to the way programs are simulated. However, 
in order to take advantage of time-varying behavior, vie must 
first develop the analytical tools necessary to automatically 
and efficiently analyze program behavior over large sections 
of execution. 

Our goal is to develop automatic techniques that are ca- 
pable of finding and exploiting the Large Scale Behavior of 
programs (behavior seen over billions of instructions). The 
first step towards this goal is the development of a hardware 
independent metric that can concisely summarize the beliav- 
ior of art arbitrary section of execution in a program. To 
this end we examine the use of Basic Btock Vectors. We 
quantify the effectiveness of Basic Btock Vectors in capturing 
program behavior across several different architectural met- 
rics, explore the large scale behavior of several programs, and 
develop a set of algorithms based on clustering capable of an- 
alyzing this behavior. We tiien demonstrate an application of 
this technology to automatically determine where to simulate 
for a program to help guide computer architecture research. 

1. INTRODUCTION 

Programs can have wildly different behavior over their run 
time, and these behaviors can be seen even on the largest of 
scales. Understanding theae large scale program behaviors 
can unlock many new optimizations. These range from new- 
thread scheduling algorithms that make use of information on 
when a thread's behavior changes, to feedback directed op- 
timizations targeted at not only the aggregate performance 
of the code but individual phases of execution, to creating 
simulations that accurately model full program behavior. To 
enable these optimizations, we must first develop the analyt- 
ical tools necessary to automatically and efficiently analyze 



Permission to make digital or bard copies of all or part of Ihis work for 

personn! or classroom use is granted without fee provided that copies are 

not made or distributed Tor profit or commercial advantage and that copies 

bear this notice nnd the full citation on the first page. To copy otherwise, to 

republish. Id post on servers or to redistribute to lists, requires prior specific 

permission nnd/or a fee. 

ASPLOSX. 10/02, San Jose, CA T USA. 

Copyright 2002 ACM 1-581 1 3-574-2/02/00 10 ..,$5.00. 



program behavior over large sections Df execution. 

In order to perform such an analysis we need to develop a 
hardware independent metric that can concisely summarize 
the behavior of on arbitrary section of execution in a pro- 
gram. In [19], we presented the use of Basic Block Vectors 
(BBV), which uses the structure of the program that is ex- 
ercised during execution to determine where to simulate. A 
BBV represents the code blocks executed during a given in- 
terval of execution. Our goal was to find a single continuous 
window of executed instructions that match the whole pro- 
gram's execution, so that this smaller window of execution 
can be used for simulation instead of executing the program 
to completion. Using the BBVs provided us with a hardware 
independent way of finding this small representative window. 

In this paper we examine the use of BBVs for analyzing 
large scale program behavior. We use BBVs to explore the 
large scale behavior of several programs and discover the 
ways in which common patterns, and code, repeat themselves 
over the course of execution. We quantify the effectiveness of 
basic block vectors in capturing this program behavior across 
several different architectural metrics (such as IPC, branch, 
and cache miss rates). 

In addition to this, there is a need for a way of classifying 
these repeating patterns so that this information can be used 
for optimization. We show that this problem of classifying 
sections of execution is related to the problem of cluster- 
ing from machine learning, and we develop an algorithm to 
quickly and effectively find these sections based on clustering. 
Our techniques automatically break the full execution of the 
program up into several sets, where the elements of each set 
are very similar. Once this classification is completed, anal- 
ysis and optimization can be performed on a per-set basis. 

We demonstrate an application of this cluster-based be- 
havior analysis to simulation methodology for computer ar- 
chitecture research. By making use of clustering information 
we are able to accurately capture the behavior of a whole 
program by taking simulation results from representatives of 
each cluster and weighing them appropriately. This results 
in finding a set of simulation points that when combined ac- 
curately represents the target application and input, which 
in turn allows the behavior of even very complicated pro- 
grams such as gec to be captured with a small amount of 
simulation time. We provide simulation points (points in the 
program to start execution at) for Alpha binaries of all of the 
SPEC 2000 programs. In addition, we validate these simula- 
tion points with the IPC, branch, and cache miss rates found 
for complete execution of the SPEC 2000 programs. 

The rest of the paper is laid out as follows. First, a sum- 
mary of the methodology used in this research is described 



in Section 2. Section 3 presents a brief review of basic block 
vectors and an in depth look into the proposed techniques 
and algorithms for identifying large scale program behaviors, 
and an analysis of their use on several programs. Section 4 
describes how clustering can be used to analyse program be- 
havior, and describes the clustering methods used in detail. 
Section 5 examines the use of the techniques presented in 
Sections 3 and 4 an an example problem: finding where to 
simulate in a program to achieve results representative of full 
program behavior. Related work is discussed in Section 6, 
and the techniques presented are summarized in Section 7. 

2. METHODOLOGY 

In this paper we used both ATOM [21] and SimpleScalar 
3.0c [3] to perform our analysis and gather our results for 
the Alpha AXP ISA. ATOM is used to quickly gather pro- 
filing information about the code executed for a program. 
SimpleScalar is used to validate the phase behavior we found 
when clustering our basic block profiles showing that this 
corresponds to the phase behavior in the programs perfor- 
mance and architecture metrics. The baseline microarchitec- 
ture model we simulated is detailed in Table 1. We simulate 
an aggressive 8-way dynamically scheduled microprocessor 
with a two level cache design. Simulation ia execution-driven, 
including execution down any speculative path until the de- 
tection of a fault, TLB miss, or branch mis-prediction. 

We analyze and simulated all of the SPEC 2000 bench- 
marks compiled for the Alpha ISA. The binaries we used 
in this study and how they were compiled can be found at: 
http : / /www . a inpXea calai . com/. 

3. USING BASIC BLOCK VECTORS 

A basic block is a section of code that is executed from 
start to finish with one entry and one exit. We use the fre- 
quencies with which basic blocks are executed as the metric 
to compare different sections of the application's execution 
to one another. The intuition behind this is that the be- 
havior of the program at a given time is directly related tD 
the code it is executing during that interval, and basic block 
distributions provide us with this information. 

A program, when run for any interval of time, will execute 
each basic blade a certain number of times. Knowing this 
information provides us with a fingerprint for that interval 
of execution, and tells us where in the code the application 
is spending its time. The basic idea is that knowing the ba- 
sic block distribution for two different intervals gives us two 
separate fingerprints which we can then compare to find out 
how similar the intervals arc to one another. If the finger- 
prints ore similar, then the two intervals spend about the 
same amount of time in the same code, and the performance 
of those two intervals should be similar. 

3.1 Basic Block Vector 

A Basic Block Vector (BBV) is a single dimensional array, 
where there is a single element in the array for each static 
basic block in the program. For the results in this paper, the 
basic block vectors are collected in intervals of 100 million 
instructions throughout the execution of a program. At the 
end of each interval, the number of times each basic block is 
entered during the interval is recorded and a new count for 
each basic block begins for the next interval of 100 million in- 
structions. Therefore, each element in the array is the count 
of how many times the corresponding basic block has been 
entered during on interval of execution, multiplied by the 



number of instructions in that basic block. By multiplying 
in the number of instructions in each basic block we insure 
that we weigh instructions the same regardless of whether 
they reside in a large or small basic block. We say that a Ba- 
sic Black Vector which was gathered by counting basic block 
executions over an interval oEJV x 100 million instructions, 
is a Basic Block Vector of duration N. 

Because we are not interested in the actual count of basic 
block executions for a given interval, but rather the propor- 
tions between time spent in basic blocks, a BBV is normal- 
ized by having each element divided by the sum of all the 
elements in the vector, 

3.2 Basic Block Vector Difference 

In order to End patterns in the program we must first have 
some way of comparing two Basic Black Vectors. The oper- 
ation we desire takes as input two Basic Block Vectors, and 
outputs a single number which tells us how close they are to 
each other. There are several ways of comparing two vectors 
to one another, such as taking the dot product or finding 
the Euclidean or Manhattan distance. In this paper we use 
both the Euclidean and Manhattan distances for comparing 
vectors. 

The Euclidean distance can be found by treating each vec- 
tor as a single point in XJ-dimensional space. The distance 
between two points is simply the square root of the sum of 
squares just as in c 3 = a 2 +b 2 . The formula for computing the 
Euclidean distance of two vectors a and b in D-dimensional 
space is given by: 



EudideanDistfa, b) = ^ ^ (ai — hif 

The Manhattan distance on the other hand is the distance 
between two points if the only paths you can take are parallel 
to the axes. In two dimensions this is analogous to the dis- 
tance traveled if you were to go by car through city blocks. 
This has the advantage that it weighs more heavily differ- 
ences in each dimension (being closer in the z-dimension does 
not get yau any closer in the y-dimenston). The Manhattan 
distance is computed by summing the absolute value of the 
element-wise subtraction of two vectors. For vectors a and b 
in .D-dimensional space, the distance can be computed as: 

D 

Manha.tta-n.Dist (a, b) = ^ |o< — &*l 
>=i 

Because we have normalized all of the vectors, the Manhat- 
tan distance will always be a single number between 0 and 2 
(because we normalize each BBV to sum to 1). This number 
can then be used to compare how closely related two intervals 
of execution are to one another. For the rest of this section 
we will be discussing distances in terms of Manhattan dis- 
tance, because wc found that it more accurately represented 
differences in our high-dimensional data. We present the Eu- 
clidean distance as it pertains to the clustering algorithms 
presented in Section 4, since it provides a more accurate rep- 
resentation for data with lower dimensions. 

3.3 Basic Block Similarity Matrix 

Now that we have a method of comparing intervals of pro- 
gram execution to one another, we can now concentrate on 
finding phase-based behavior. A phase of program behav- 
ior can be defined in several ways. Fast definitions are built 
around the idea of a phase being a contiguous interval of exe- 



2 



Instruction Cache 


8k 2-way set-associative, 32 byte blocks, 1 cycle latency 


Data Cache 


16k 4-way set-associative, 32 byte blocks, 2 cycle latency 


Unified L2 Cache 


lMeg 4-way set-associative, 32 byte blocks, 20 cycle latency 


Memory 


150 cycle round trip access 


Branch Predictor 


hybrid - 8-bit gsh&re w/ 8k 2-bit predictors a 8k bimodal predictor 


Out-oF-Order Issue 


out-of-order issue of up to 8 operations per cycle, 12B entry re-order buffer 


Mechanism 


Ioad/stDre queue, loads may execute when all prior store addresses are known 


Architecture Registers 


32 integer, 32 floating point 


Functional Units 


B-integer ALU, 4-load/store units, 2-FP adders, 2-integer MULT/DIV, 2-FP MULT/DIV 


Virtual Memory 


8K byte pages, 30 cycle fixed TLB miss latency after earlier-issued instructions complete 



Table 1: Baseline Simulation Model. 



cution during which a measured program metric is relatively 
stable. We extend this notion of a phase to include all similar 
sections of execution regardless of temporal adjacency. 

A key observation from this paper is that the phase be- 
havior seen in any program metric is directly a function of 
the code being executed. Because of this we can use the 
comparison between the Basic Block Vectors as an approxi- 
mate bound on haw closely related any other metrics will be 

To find Iidw intervals of execution relate to one another we 
create a Basic Black Similarity Matrix. The similarity matrix 
is an upper triangular NxN matrix, where N is the number 
of intervals in the program's execution. An entry at (x, y) in 
the matrix, represents the Manhattan distance between the 
basic block vector at interval x and the basic block vector at 
interval y. 

Figures l(left and right) and 4 (left) shows the similarity 
matrices for gzip, bzip, and gec using the Manhattan dis- 
tance. The diagonal of the matrix represents the program's 
execution over time from start to completion. The darker 
the points, the more similar the intervals are (the Manhat- 
tan distance is closer to 0), and the lighter they are the more 
different they are (the Manhattan distance is closer to 2). 

The top left corner of each graph is the start of program 
execution and is the origin of the graph, (0, 0), and the bot- 
tom right of the graph is the point (N — 1, N — 1) where N 
is the number of intervals that the full program execution 
was divided up into. The way to interpret the graph is to 
start considering points along the diagonal axis drawn. Bach 
point is perfectly similar to itself, sd the points directly on 
the axis all are drawn dark. Starting from a given point on 
the diagonal axis of the graph, you can begin to compare how 
that point relates to it's neighbors forward and backward in 
execution by tracing horizontally or vertically. If you wish 
to compare a given interval x with the interval at x + n, 
you simply start at the point (x, x) on the graph and trace 
horizontally to the right until you reach (ar,z + u). 

To examine the phase behavior of programs, let us first 
examine gzip because it has behavior on such a large scale 
that it is easy to see. If we examine an interval taken from 70 
billion instructions into execution, in Figure 1 (left), this is 
directly in the middle of a large phase shown by the triangle 
block of dark color that surrounds this point. This means 
that this interval is very similar to it's neighbors both forward 
and backward in time. We can also see that the execution at 
50 billion and 90 billion instructions is also very similar to the 
program behavior at 70 billion. We also note, while it may 
be hard to see in a printed version that the phase interval at 
70 billion instructions is similar to the phases at interval 10 
and 30 billion, but they are not as similar as to those around 
50 and 90 billion. Compare this with the IPG and data cache 
miss rates for gzip shown in Figure 2. Overall, Figure l(left) 
shows that the phase behavior seen in the similarity matrix 
lines up quite closely with the behavior of the program, with 



5 large phases (the first 2 being different from the last 3) each 
divided by a small phase, where all of the small phases are 
very similar to each other. 

The similarity matrix for bzip (shown on the right of Fig- 
ure 1) is very interesting. Bzip has complicated behavior, 
with two large parts to it's execution, compression and de- 
compression. This can readily be seen in the figure as the 
large dark triangular and square patches. The interesting 
thing about bzip is that even within each of these sections of 
execution there is complex behavior. This, as will be shown 
later, makes the behavior of bzip impossible to capture using 
a small contiguous section of execution. 

A more complex cose for finding phase behavior is gec, 
which is shown on the left of Figure 4. This similarity ma- 
trix shows the results for gec using the Manhattan distance. 
The similarity matrix on the right will be explained in more 
detail in Section 4.2.1. This figure shows that gec does have 
some regular behavior. It shows that, even here, there is com- 
mon code shared between sections of execution, such as the 
intervals around 13 billion and 36 billion. In fact the strong 
dork diagonal line cutting through the matrix indicates that 
there is good amount of repetition between offset segments of 
execution. By analyzing the graph we can see that interval 
x is very similar to interval {x + 23.6B) for oil a:. 

Figures 2 and 5 show the time varying behavior of gzip 
and gec. The average IPC and data cache miss rate Is shown 
for each 100 million interval of execution over the complete 
execution of the program. The time varying results graphi- 
cally show the same phase behavior seen by looking at only 
the code executed. For example, the two phases for gec at 13 
billion and 3S billion, shown to be very similar in Figure 4, 
are shown to have the same IPC and data cache miss rate in 
Figure 5. 

4. CLUSTERING 

The basic block vectors provide a compact and represen- 
tative summary of the program's behavior for intervals of 
execution. By examining the similarity between them, it is 
clear that there exists a high level pattern to each program's 
execution. In order to make use of this behavior we need 
to start by delineating a method of Ending and represent- 
ing the information. Because there are so many intervals of 
execution that are similar to one another, one efficient repre- 
sentation is to group the intervals together that have similar 
behavior. This problem is analogous tD a clustering problem. 
Later, in Section 5, we demonstrate how we use the clusters 
we discover to find multiple simulation points for irregular 
programs or inputs like gec. By simulating only a single rep- 
resentative from each cluster, we con accurately represent the 
whole program's execution. 

4.1 Clustering Overview 

The goal of clustering is to divide a set of points into groups 



3 



Figure 1: Basic block similarity matrix for the programs gzip -graphic (shown left) and bzip-graphic (shown 
right). The diagonal of the matrix represents the program's execution to completion with units in billions of 
instructions. The darker the points, the more similar the intervals are (the Manhattan distance is closer to 
0), and the lighter the points the more different they are (the Manhattan distance is closer to 2). 



x 100% 
S 80%- 

S3 20% 



Li U Li U i 



6 

o 5 

fe 4 
% 3 
£2 
1 



HUH I I 





1 1 [ 1 1 1 1 1 1 1 


Mil 


1 1 1 1 1 1 


1 1 1 1 1 1 1 1 1 1 










— 1 1 




— ■■ h 


— 1 






, 1 , j— ' I ' I ' 1 



40B 

Instructions Executed (in Billions) 



100B 



Figure 2: (top graph) Time varying graph for gzip-graphic. The average IPC (drawn with solid line) and 
LI data cache miss rate (drawn with dotted line) are plotted for every interval (100 million instructions of 
execution) showing how these metrics vary over the program's execution. The x-axis represents the execution 
of the program over time, and the y-axis the percent of max value the metric had during execution. The 
results are non-accumulative. 

Figure 3: (bottom graph) Cluster graph for gzip-graphic. The full ru 
set of 6 clusters. The x-axis is in instructions executed, and the grap 
(every 100 million instructions), which cluster the interval was placed 



a execution is partitioned into a 
's for each interval of execution 



4 



Figure 4: The original basic black similarity matrix for the program gcc (shown left), and the similarity matrix 
for gcc-166 drawn from projected data (on right). The figure on the left use the original basic block vectors 
(each of which has over 100,000 dimensions) and uses the Manhattan distance as a method af difference 
taking. The figure on the right uses projected data (down to IS dimensions) and uses the Euclidean distance 
for difference taking. 




0 10 20 30 40 

instructions Executed {in Billions) 



Figure 5: (top graph) Time varying graph for gce-166. The average IPC (drawn with solid line) and LI data 
cache miss rate (drawn with dotted line) are plotted for every interval (100 million instructions of execution) 
showing how these metrics vary over the program's execution. The x-oxis represents the execution of the 
program over time, and the y-axis the percent of max value the metric had during execution. The results are 
non-accumulative. 

Figure 6: (bottom graph) Cluster graph for gcc-166. The full run of the execution is partitioned into a set of 
4 clusters. The x-axis is in instructions executed, and the graph shows for each interval of execution (every 
100 million instructions), which cluster the interval was placed into. 



such that points within each group axe similar to one an- 
other (by some metric, often distance), and points in different 
groups are different from Dne another. This problem arises in 
other fields such aa computer vision [10], document classifica- 
tion [22], and genomics [1], and as such it is an area of much 
active research. There are many clustering algorithms and 
many approaches tD clustering. Classically, the two primary 
clustering approaches are Partitioning and Hierarchical: 

Partitioning algorithms choose an initial solution and then 
use iterative updates to find a better solution. Popular al- 
gorithms such as fc-means [14] and Gaussian Expectation- 
Maximization [2, pages 59-73] are in this family. These al- 
gorithms tend to have run time that is linear in the size of 
the dataset. 

Hierarchical algorithms [9] either combine together sim- 
ilar points (called agglomerative clustering, and conceptu- 
ally similar to Huffman encoding), or recursively divides the 
dataset into more groups (called divisive clustering). These 
algorithms tend to have run time that is quadratic in the size 
of the dataset. 

4.2 Phase Finding Algorithm 

For our algorithm, we use random linear projection fol- 
lowed by fc-means. We choose to use the fc-means clustering 
algorithm, since it is a very fast and simple algorithm that 
yields good results. To choose the value of fc, we use the 
Bayesian Information Criterion (BIC) score [11, 17]. The 
following steps summarize our algorithm, and then several of 
the steps are explained in more detail: 

1. Profile the basic blocks executed in each program to 
generate the basic block vectors for every 100 million 
instructions of execution. 

2. Reduce the dimension of the BBV data to 15 dimen- 
sions using random linear projection. 

3. Try the fc-means clustering algorithm on the 
low-dimensional data for fc values 1 to 10. Each run of 
fc-means produces a clustering, which is a partition of 
the data into k different clusters. 

4. For each clustering {fc = 1...10), score the fit of the 
clustering using the BIC. Choose the clustering with 
the smallest k, such that it's score is at least 90% as 
good as the best score, 

4.2.1 Random Projection 

For this clustering problem, we have to address the prob- 
lem of dimensionality. All clustering algorithms suffer from 
the so-called "curse of dimensionality" , which refers to the 
fact that it becomes extremely hard to cluster data as the 
number of dimensions increases. For the basic block vectors, 
the number of dimensions is the number of executed basic 
blocks in the program, which ranges from 2,756 to 102,038 
far our experimental data, and could grow into the millions 
for very large programs. Another practical problem is that 
the running time of our clustering algorithm depends on the 
dimension of the data, making it slow if the dimension grows 
too large. 

Two ways of reducing the dimension of data are dimension 
selection and dimension reduction. Dimension selection sim- 
ply removes all but a small number of the dimensions of the 
data, based on a measure of goodness of each dimension for 
describing the data. However, this throws away a lot of data 
in the dimensions which are ignored. Dimension reduction 



reduces the number of dimensions by creating a new lower- 
diroensianal space and then projecting each data point into 
the new space (where the new space's dimensions are not 
directly related to the old space's dimensions). This Is anal- 
ogous to taking a picture of 3 dimensional data at a random 
angle and projecting it onto a screen of 2 dimensions. 

For this work we choose to use random linear projection [5] 
to create a new low-dimensional space into which we project 
the data. This is a simple and fast technique that is very ef- 
fective at reducing the number of dimensions while retaining 
the properties of the data. There are two steps to reducing 
a dataset X (which Is a matrix of basic block vectors and is 
of size N in t m *iB * D numb b, where D num bb is the number of 
basic blocks in the program) down to D nBW dimensions using 
random linear projection: 

• Create a £> nU Tnbb x D ncw projection matrix M by choos- 
ing a random value for each matrix entry between -1 
and 1. 

• Multiply X times M to obtain the new lower-dimensional 
dataset X' which will be of size Mntertiiila X D„ EU ,. 

For clustering programs, we found that using D n cw = 
15 dimensions is sufficient to still differentiate the different 
phases of execution- Figure 7 shows why we chose tD project 
the data down to 15 dimensions. The graph shows the num- 
ber of dimensions on the x-axis. The y-aads represents the k 
value found to be best on average, when the programs were 
projected down to the number of dimensions indicated by the 
x-axis. The best k is determined by the k with the highest 
BIC score, which Is discussed in Section 4.2.3. The y-axis is 
shown as a percent of the maximum k seen for each program 
so that the curve can be examined independent of the actual 
number of clusters found for each program. The results show 
that for 15 dimensions the number of clusters found begins 
to stabilize and only climbs slightly. Similar results were also 
found using a different method of finding k in [6]. 

The advantages of using linear projections are twofold. 
First, creating new vectors with a low dimension of 15 is 
extremely fast and can even be done at simulation time. Sec- 
ondly, using only 15 dimensions speeds up the k-tneans al- 
gorithm significantly, and reduces the memory requirements 
by several orders of magnitude over using the original basic 
block vectors. 

Figure 4 shows the similarity matrix for gec on the left 
using original BBVs, whereas the similarity matrix on the 
right ahowB the same matrix but on the data that has been 
projected down to 15 dimensions. For the reduced dimension 
data we use the Euclidean distance to measure differences, 
rather than the Manhattan distance used on the full data. 
After the projection, some information will be blurred, but 
overall the phases of execution that are very similar with full 
dimensions can still be seen to have a strong similarity with 
only 15 dimensions. 

4.2.2 K-means 

The fc-means algorithm is an iterative optimization algo- 
rithm, which executes as two phases, which are repeated to 
convergence. The algorithm begins with a random assign- 
ment of fc different centers, and begins its iterative process. 
The iterations are required because of the recursive nature 
of the algorithm; the cluster centers define the cluster mem- 
bership for each data point, but the data point memberships 
define the cluster centers. Each point in the data belongs to, 
and can be considered a member of, a single cluster. 



6 



g 40% 



Number of Dimensions 

Figure 7: Motivation for random projection down 
to 15 dimensions (D=15). The x-axis 5a the num- 
ber of dimensions of the projection, and the y-ajtis ia 
the percent of the max number of clusters found for 
each program averaged over all spec programs. The 
results show that as you decrease the number of di- 
mensions too far (the lowest point is two dimensions) 
the true clusters become collapsed on one another, 
and the algorithm cannot find as many clusters. By 
D=15 most of this effect has gone. 

We initialize the k cluster centers by choosing fe random 
points from the data to be clustered. After initialization, the 
/c-means algorithm proceeds in two phases which are repeated 
until convergence: 

• For each data point being clustered, compare its dis- 
tance to each d£ the k cluster centers and assign it to 
(make it a member of) the cluster to which it is the 
closest. 

• For each cluster center, change its position to the cen- 
troid of all of the points in its cluster (from the mem- 
berships just computed). The centroid is computed as 
the average of all the data points in the cluster. 

This process is iterated until membership (and hence clus- 
ter centers) cease to change between iterations. At this point 
the algorithm terminates, and the output is a set of final clus- 
ter centers and a mapping of each point to the cluster that 
it belongs to. Since we have projected the data down to 15 
dimensions, we can quickly generate the clusters for t-means 
with k from 1 to 10. In doing this, there are efficient algo- 
rithms for comparing the clusters that are formed for these 
different values of k, and choosing one that is good but still 
uses a small value for k is the next problem. 

4.2.3 Bayesian Information Criterion 

To compare and evaluate the different clusters formed for 
different k, we use the Bayesian Information Criterion (BIC) 
ns a measure of the "goodness of fit" of a clustering to a 
dataset. More formally, the BIC is an approximation to the 
probability of the clustering given the data that has been 
clustered. Thus, the larger the BIG score, the higher the 
probability that the clustering being scored is a "good fit" to 
the data being clustered. We use the BIC formulation given 
in [17) for clustering with fc-means, however other formula- 
tions of the BIC could also be used. 



More formally, the BIC score is a penalized likelihood. 
There are two terms in the BIC: the likelihood and the penally. 
The likelihood is a measure of how well the clustering models 
the data. To get the likelihood, each cluster is considered to 
be produced by a spherical Gaussian distribution, and the 
likelihood of the data in a cluster is the product of the prob- 
abilities of each point in the cluster given by the Gaussian. 
The likelihood for the whale dataset is just the product of the 
likelihoods for all clusters, However, the likelihood tends to 
increase without bound as more clusters are added. There- 
fore the second term is a penalty that offsets the likelihood 
growth based on the number of clusters. The BIC is formu- 
lated as 

BIC{D, k) = l(D\k) - log(A) 

where l{D\k) is the likelihood, R is the number of points in 
the data, and pj is the number of parameters to estimate, 
which is (A: - 1) -f- dk + 1 for (k — 1) cluster probabilities, 
k cluster center estimates which each require d dimensions, 
and 1 variance estimate. To compute l(D\k) we use 



Z(Djfc) = ^-^ilog(2,r) 
+Ri\o S i,Ri/R) 



Rid 



!og(<7 2 ) - 



Ri-1 



where Ri is the number of points in the ith cluster, and a* 
is the average variance of the Euclidean distance from each 
point to its cluster center. 

For a given program and inputs, the BIC score is calculated 
for each k-means clustering, for k from 1 to N. We then choose 
the clustering that achieves a BIC score that is at least 90% 
of the spread between the largest and smallest BIC score 
that the algorithm has seen. Figure 8 shows the benefit of 
choosing a BIC with a high value and its relationship with the 
variance in IPC seen for that cluster. The y-axis shows the 
percent of IPC variance seen for a given clustering, and the 
corresponding BIC score the clustering received. Each point 
on the graph represents the average or max IPC variance for 
all points in the range of ±5% of the BIC score shown. The 
results show that picking clusterings that represent greater 
than 80% of the BIC score resulted in an IPC variance of less 
than 20% on average. The IPC variance was computed as the 
weighted sum of the IPC variance for each cluster, where the 
weight for a cluster is the number of points in that cluster. 
The IPC variance for each cluster is simply the variance of 
the IPC for all the points in that cluster. 

4.3 Clusters and Phase Behavior 

Figures 3 and 6 show the 6 clusters formed for gzip and 
the 4 clusters formed for gcc. The X-axis corresponds to 
the execution of the program in billions of instructions, and 
each interval (each of 100 million instructions) is tagged to 
be in one of the N clusters (labeled on the Y-axis). These 
figures, just as for Figures 1 and 4, show the execution of the 
programs to completion. 

For gzip, the full run of the execution is partitioned into 
a set of 6 clusters. Looking to Figure l(left) for comparison, 
we see that the cluster behavior captured by our tool lines up 
quite closely with the behavior of the program. The majority 
of the points are contained by clusters 1,2,3 and 6. Clusters 
1 and 2 represent the large sections of execution which are 
similar to one another. Clusters 3 and 6 capture the smaller 
phases which lie in between these large phases, while cluster 
5 contains a small subset of the larger phases, and cluster 4 
represents the initialization phase. 




20% 40% 60% 80% 100% 
Percent BIC 

Figure 8: Plot of average IPC variance and max IPC 
variance versus the BIC. These results indicate that 
for our data, a clustering found to have a BIC score 
greater than 80% will have, on average, and IPC vari- 
ance of less than 0.2- 

In the cluster graph for gee, shown in Figure 6, the run 
is now partitioned into 4 different clusters. Looking to Fig- 
ure 4 for comparison, we see that even the more complicated 
behavior of gcc is captured correctly by our tool. Clusters 
2 and 4 correspond to the dark boxes shown parallel to the 
diagonal axis. It should also be noted that the projection 
does introduce some degree of error into the clustering. For 
example, the first group of points in cluster 2 are not really 
that similar to the other points in the cluster. Comparing 
the two similarity matrices in Figure 4, shows the introduc- 
tion of a dark band at (0,30) on the graph which was not in 
the original (un-projected) data. Despite these small errors, 
the clustering is still very good, and the impact of any such 
errors will be minimized in the next section. 

5. FINDING SIMULATION POINTS 

Modern computer architecture research relies heavily on 
cycle accurate simulation to help evaluate new architectural 
features. While the performance of processors continues to 
grow exponentially, the amount of complexity within a pro- 
cessor continues to grow at an even a faster rate. With each 
generation of processor more transistors are added, and more 
things are done in parallel on chip in a given cycle while at 
the same time cycle times continue to decrease. This grow- 
ing gap between speed and complexity means that the time 
to simulate a constant amount of processor time is growing. 
It is already to the point that executing programs fully to 
completion in a detailed simulator is no longer feasible for 
architectural studies. Since detailed simulation takes a great 
deal of processing power, only a small subset of a whole pro- 
gram can be simulated. 

SimpleScalar [3], one of the faster cycle-level simulators, 
can simulate around 400 million instructions per hour. Un- 
fortunately many of the new SPEC 2000 programs execute 
for 300 billion instructions or more. At 400 million instruc- 
tions per hour this will take approximately 1 month of CPU 
time. 

Because it is only feasible to execute a small portion of 
the program, it is very important that the section simulated 
is an accurate representation of the program's behavior as a 



whole. The basic block vector and cluster analysis presented 
in Sections 3 and 4 will allow us to make sure that this is the 



5.1 Single Simulation Points 

In [19], we used basic block vectors to automatically find a 
single simulation point to potentially represent the complete 
execution of a program. A Simulation Point is a starting 
simulation place (in number Df instructions executed from the 
start of execution) in a program's execution derived from our 
analysis. That algorithm creates a target basic block vector, 
which is a BBV that represents the complete execution of 
the program. The Manhattan distance between each interval 
BBV and the target BBV is computed. The BBV with the 
lowest Manhattan distance represents the single simulation 
point that executes the code closest to the complete execution 
of the program. This approach is used to calculate the long 
single simulation points (LongSP) described below. 

In comparison, the single simulation point results in this 
paper are calculated by choosing the BBV that has the small- 
est Euclidean distance from the centroid of the whole dataset 
in the 15-dimensional Bpace, a method which we find supe- 
rior to the original method. The 15-dimensional centroid is 
formed by taking the average of each dimension over all in- 
tervals in the cluster. 

Figure 9 shows the IPC estimated by executing only a 
single interval, all 100 million instructions long but chosen 
by different methods, for all SPEC 2DD0 programs. This 
is shown in comparison to the IPC found by executing the 
program to completion. The results are from SimpleScalar 
using the architecture model described in Section 2, and all 
fast forwarding is done so that all of the architecture struc- 
tures are completely warmed up when starting simulation (no 
cold-start effect). 

The first bar, labeled none, is the IPC found when exe- 
cuting only the first 100 million instructions from the start 
of execution (without any fast forwarding). The second bar, 
FF-Billion shows the results after blindly fast forwarding 1 
billion instructions before starting simulation. The third bar, 
SimPoin-t shows the IPC using our single simulation point 
analysis described above, and the last bar shows the IPC of 
simulating the program to completion (labeled Full). Be- 
cause these are actual IPC values, values which are closer to 
the Full bar are better. 

The results in Figure 9 shows that the single simulation 
points are very close to the actual full execution of the pro- 
gram, especially when compared against the ad-hoc tech- 
niques. Starting simulation at the start of the program re- 
sults in an average error of 210%, whereas blindly fast for- 
warding results in an average 80% IPC error. Using our single 
simulation point analysis we reduce the average IPC error to 
18%. These results show that it is possible to reasonably cap- 
ture the behavior of the most programs using a very small 
slice of execution. 

Table 2 shows the actual simulation points chosen along 
with the program counter (PC) and procedure name corre- 
sponding to the start of the interval. If an input is not at- 
tached to the program name, then the default ref input was 
used. Columns 2 through 4 are in terms of the number of 
intervals (each 100 million instruction long). The first col- 
umn is the number of instructions executed by the program, 
on the specific input, when run to completion. The second 
column shows the end of initialization phase calculated as de- 
scribed in [19]. The third column shows the single simulation 
point automatically chosen as described above. This simu- 



{" 




Figure 9: Simulation results starting simulation at the start of the program (none), blindly Fast forwarding 1 
billion instructions, using a single simulation point, and the IPC of the full execution of the program. 



lation point is used to fast forward to the point of desired 
execution. Some simulators, debuggers, or tracing environ- 
ments (e.g., gdb) provide the ability to fast forward based 
upon a program PC, and the number of times that PC was 
executed. We therefore, provide the instruction PC for the 
start of the simulation point, the procedure that PC occurred 
in, and the number of times that PC has to be executed in 
order to arrive at the desired simulation point. 

These results show that a single simulation point can be 
accurate for many programs, but there is still a significant 
amount of error for programs like bzip, gzip and gcc. This 
occurs because there are many different phases of execution 
in these programs, and a single simulation point will not ac- 
curately represent all of the different phases. To address this, 
we used our clustering analysis to find multiple simulation 
points to accurately capture these programs behavior, which 
we describe next. 



5.2 Multiple Simulation Points 

To support multiple simulation points, the simulator can 
be run from start to stop, only performing detailed simu- 
lation on the selected intervals. Or the simulation can be 
broken down into N simulations, where N is the number of 
clusters found via analysis, and each simulation is run sepa- 
rately. This has the further benefit of breaking the simulation 
down into parallel components that can be distributed across 
many processors. This is the methodology we use in our sim- 
ulator. For both cases results from the separate simulation 
points need to be weighed and combined to arrive at overall 
performance for the program [4]. Care must be taken to com- 
bine statistics correctly (simply averaging will give incorrect 
results for statistics such as rates). 

Knowing the clustering alone is not sufficient to enable 
multiple point simulation because the cluster centers do not 
correspond to actual intervals of execution. Instead, we must 
first pick a representative for each cluster that will be used to 
approximate the behavior of the the full cluster. In order to 
pick this representative, we choose for each cluster the actual 
interval that is closest to the center (centroid) of the cluster. 



In addition to this, we weigh any use of this representative by 
the sfcse of the cluster it is representing. If a cluster has only 
one point, it's representative will only have a small impact 
on the overall outcome of the program. 

Table 2 shows the multiple simulation points found for all 
of the SPEC 2000 benchmarks. For these results we lim- 
ited the number of clusters to be at most six for all but the 
most complex programs. This was done, in order to limit the 
number of simulation points, which also limits the amount of 
warmup time needed to perform the overall simulation. The 
cluster formation algorithm in Section 4 takes as an input 
parameter the max number of clusters to be allowed. Each 
simulation point contains two numbers. The first number is 
the location of the simulation point in IGOs of millions of in- 
structions. The second number in parentheses is the weight 
for that simulation point, which is used to create an overall 
combined metric. Each simulation point corresponds to 100 
million instructions. 

Figure 10 shows the IPC results for multiple simulation 
points. The first bar shows our single simulation points sim- 
ulating for 100 million instructions. The second bar LongSP 
chooses a single simulation point, but the length of simula- 
tion is identical to the length used for multiple simulation 
points (which may go up to 1 billion instructions). This is 
to provide a fair comparison between the single simulation 
points and multiple. The Multiple bar shows results using 
the multiple simulation points, and the final bar is IPC for 
full simulation. As in Figure 9, the closer the bar is to Full, 
the better. 

The results show that the average IPC error rate is re- 
duced to 3% using multiple simulation points, which is down 
from 17% using the long single simulation point. This is sig- 
nificantly lower than the average 80% error seen for blindly 
fast forwarding. The benefits can be most clearly seen in 
the programs bzip, gcc, aminp, and galgel. The reason that 
the long contiguous simulation points do not do much better 
is that they are constrained to only sample at one place in 
the program. For many programs this is sufficient, but for 
those with interesting long term behavior, such as bzip, it is 



9 







Jtiit 


SP 


PC 


Proc Name 


Multiple SlmPoints 




"3265 


24" 




02GS34 




3D2G(13.B) 
1607(12-51 


2437(4.9) 


595(15.3) 
3112(11.5) 


106B(1.3) 
2480(2.2) 




apP lu 


223B 


3 


218D 


01B520 


but,. 


624(22.1) 
1507(14.5) 


1525(22.5) 


1956(18. B) 


2234(6.6) 


1380(15.5) 


apsl 


3479 


3 


3409 


0330QC 


dctdxf. 


2107(5.6) 


2853(14) 


1007(70.7) 


896(7.7) 
















82(42.0) 


255(41.2) 


50(15. b) 


















300(36.2) 






















168(11.7) 
519(11.6) 


1042(3.7) 
872(8.2) 


430(7.5) 
195(5,5) 


762(16.2) 
143(2) 


1435 (18.2) 


bztp2-prograin 


1249 


4 


459 


ODdddU 


sortlt 


140(11) 
1005(7) 


4GB(12.3> 


606(14) 


990(16} 
859(14.6) 


341(1:7) 


bzip2-souree 


10S3 




□78 


00d774 


qSort3 






64(29.1) 


488(7.3) 


530(8.6) 














123(25) 


510(19.7) 


664(22.7) 


1123(32.5) 




e c-n-r U *hn, E J e r 


573 


14D 


404 


04elb4 


vlewingUit 


260(6.6) 


238(23.7) 


337(20.9) 


435(35.6) 
















874(12.2) 
62(11.6) 


1292(36.7) 


463(12.2) 


336(24.1) 


3(3.2) 














1976(60.1) 


1528(2.5) 


1935(3.9) 


1398(29,2) 


348(4.3) 


fmi3d 


26B3 


192 


2542 


0e3140 


acotter-element. 


112(7) 
509(13) 


209(0.6) 


842(68.4) 


1600(11) 


47(0.1) 


galgel 


4093 


3 


2492 


02db00 


ayshtn. 


3511(5.6) 
2181(29) 


2081(11) 
2161(3.3) 


3466(11.2) 
1017(5.5) 


516(31.6) 


2141(2.7) 










□50750 




1114(8.2) 


1196(58.11 


88(12.7) 


21B9(14) 


2609(7.1) 


gcc-156 


469 












149(42.2) 


30(21.3) 


404(30,1) 


















587(17.9) 


921(10.9) 


575(14.5) 








27 


37 


lGlfdO - 




63(12.5) 


31(15,8) 




25(4.2) 














find-slnglejise. 


11B{9.2J 


41(27.5) 


102(21.4) 


9(20.6) 


57(3.B) 




620 


139 




100d54 
















1037 


158 


654 


□OScOO 




961(45.4) 


67(26. 5 J 


373(7.3) 


1(0.1) 






335 






00d2B0 




207(24.1) 


171<1B.5> 


157(16.71 


330(23.5) 


71(19.21 


gzip-prosram 


1688 








longest-match 


228(22.7) 


779(21.4) 


472(9.l) 


1410(20.4) 


594(26.4) 


gaip-rnTidom 


821 


152 


624 


00al4c 


deflate 


484(0.9) 


625(0.2) 


5B0 (51) 




200(30.9) 










□0a224 




248(14.5) 
720(2.5) 


327(13.2) 


167(17.7) 


65G(27.8) 


373(24.4) 


i 


1423 








flt-sqUnre- 


9B2(21.4) 


602(10.7) 


1370(21.4) 


45B(28) 


524(18.6) 


mcf 


618 


15 


554 


00911c 


price-autJmp] k 


26B(39.fi) 
143(3,9) 


425(11) 




468(4.5) 


316(10.B) 


mesn 


2H16 




1136 






1346(35.3) 


2806(0.7) 


398(35.3) 






mgrid 








OlGOfO 






3459(22.8) 


8Q7(2D.l) 


3110(16.3) 


247D(1G.G) 








1147 






3342(25 1) 


1771(29.8) 


5102(19.7) 


2008(19.4) 


4772(6) 


perlbmk-dlfT 


399 


56 


142 


07tB74 


regmutch 


sns 

239(31.8) 


355(62.7J 


11(0.5) 


397(0. B) 


12(3.3) 


perlbmk-make 


20 


3 


12 


08268c 


Perl_runtjps jl. 


1(5) 


20(20) 


6(75) 






perlbmk-perf 


290 




6 


08268c 


Perl-runcps-st. 


39(59.31 


207(40.71 








perlbmk-split 


HQS 


162 


451 


07fc98 


regmntch 


7U4(44.B) 


596(9.1) 


232(21.7) 






sbctrack 


4709 


250 


3044 


1G7B94 


thinBd- 


6(1.71 


1719(98.3) 


777(21.7) 


710(13.8) 


2101(17.8 




twolf 


22S8 
34G4 


3 
7 


20B0 
1067 


019130 
041094 


cnlcl_ 


1951(29.8) 
312(17) 


38(14) 
2838 11.3) 


3268(11.7) 


961(20.4) 


2054(39.5 




vartex-one 


1189 


36 


272 


06289c 


Metn-GetWord 


536(17.11 


356 23.3) 


115(8.2) 


106B(17.2) 


878(34.2 




vortex-three 


1330 


177 


565 


0336aS 


Part-Delete 


934(25.4) 
4S5(25.4) 


1129(11.4) 


96(8.9) 


47(11.1) 


586(17.8 




vortex-two 


1386 


206 


1025 


OSeBIc 


Meiii-NewRegiuri 


635(7.6i 
397(23.2) 


752(24.5) 


554(21.9) 


930(7.4) 






1122 


4 


593 


0224ec 


Setjiomipdate. 


166(25.5) 
547(27.9) 


857(21.6) 


1(0.2) 


362(12.8) 


1057(12) 


vpr-route 


840 


12 


~ 477 


025c80 


EetJietipJiead 






353(23.8) 


3(2.6) 


49D(15.7) 


wupwisc 


349G 




3238 


01dC30 




1811(43.3) 


91(8) 


. 3055t43.2) 


1524(5.4) 





Table 2: Single simulation points for SPEC 200D benchmarks. Columns 2 through 4 are in terms of 100 
million instruction executed. The length of full execution is shown, as well as the end of initialization. SP is 
the single simulation point using the approach in this paper. The procedure in which the simulation point 
occurred and its PC are also shown. The last 6 digits of PC of each SimPoint is given in hex, so the address 
is formed from 120xxxxxx. Procedure names that end in "." were truncated due to space. The rest of the 
columns list the multiple simulation points found in 100s of millions. The first number is the starting place 
of the simulation point relative to the start of execution. The second number shows the weight given to the 
cluster that simulation point was taken from, and is used when weighing the final results of the simulation. 



ID 



i LongSP 



i Multiple 












11 


IffliffliwinlilrilJnil 


111 rurnf 


Innnirii 


I I 

a a 


'ill??!?? 

* $ 1 1 1 1 i t 


per-mak 
par-ref 
mcf-ref 
gzi-pro 
gzi-gra 


vpr-rou 
vor-two 
two-ref 
per-spl 



Figure 10: Multiple simulation point results. Simulation results are shown for using a single simulation point 
simulating for 100 million instructions, LongSP chooses a single simulation point simulating for the same 
length of execution as the multiple point simulation, simulation using multiple simulation points, and the full 
execution of the program. 



impossible to approximate the full behavior. 

Figure 11 is the average over all of the floating point pro- 
grams (top graph] and integer programs (bottom graph). Er- 
rors For IPC, branch miss rate, instruction and data cache 
miss rates, and the unified L2 cache miss rate for the archi- 
tecture presented in Section 2 are shown. The errors are with 
respect to these metrics for the full length of simulation us- 
ing SimpleScalar. Results are shown for starting simulation 
at the start of the program None, blindly fast forwarding a 
billion instructions FF-Billion, single simulation paints of 
duration 1 (SinsPoin-t) and k (LongSP), and multiple simula- 
tion points (Multiple). 

The first thing to note is that using the just a single small 
simulation point performs quite well on average across all of 
the metrics when compared to blindly fast-forwarding. Even 
though a single SimPoint does well, it is clearly beaten by 
using the clustering based scheme presented in this paper 
across all of the metrics examined. One thing that stands out 
on the graphs is that the error rate of the instruction cache 
and L2 cache appear to be high (especially for the integer 
programs) despite the fact that our technique is doing quite 
well in terms of overall performance. This is due to the fact 
that we present here an arithmetic mean of the errors, and 
there are several programs that have high error rates due to 
the very small number of cache misses. If there are 10 misses 
in the whole program, and we estimate there to be 10D, that 
will result in a error of 1CX. We point to the overall IPC 
as the most important metric for evaluation as it implicitly 
weighs each oF the metrics by it's relative importance. 



6. RELATED WORK 

Time Varying Behavior of Programs: In [IS], we provided 
a first attempt at showing the periodic patterns for all of 
the SPEC 95 programs, and haw these vary over time for 
cache behavior, branch prediction, value prediction, address 
prediction, IPC and RTJU occupancy. 



Training Inputs and Finding Smaller Representative In- 
puts: One approach for reducing the simulation time is to 
use the training or test inputs from the SPEC benchmark 
suite. For many of the benchmarks, these inputs are either 
(1) still too long to fully simulate, or (2) too short and place 
too much emphasis on the startup and shutdown parts of 
the program's execution, Dr (3) inaccurately estimate behav- 
ior such as cache accesses do to decreased worldng set size. 

KleinOsowslri et. al [12], have developed a technique where 
they manually reduce the input sets of programs. The input 
sets were developed using a range of approaches from trun- 
cating of the input files to modification of source code to 
reduce the number of times frequent loops were traversed. 
For these input sets they develop, they make sure that they 
have similar results in terms of IPC, cache, and instruction 

Fast Forwarding and Check-pointing: Historically researchers 
have simulated from the start of the application, but this 
usually does not represent the majority of the program's be- 
havior because it is still in the initialization phase. Recently 
researchers have started to fast-forward to a given point in 
execution, and then start their simulation from there, ide- 
ally skipping over the initialization code to an area of code 
representative of the whole. During fast-forward the simula- 
tor simply needs to act as a functional simulator, and may 
take full advantage of optimizations like direct execution. Af- 
ter the fast-forward point has been reached, the simulator 
switches to full cycle level simulation. 

After fast-forwarding, the architecture state to be simu- 
lated is still cold, and a waxmup time is needed in order to 
start collecting representative results. Efficiently warming 
up execution only requires references immediately proceed- 
ing the start of simulation. Has kins and Skadron [7] exam- 
ined probabilistically determining the minimum set of fast- 
forward transactions that must be executed for warm up to 
accurately produce state as it would have appeared had the 
entire fast-forward interval been used for warm up [7]. They 



11 



] LongSP 



a Multiple 




Figure 11: Average error results 
for IPC, branch misprediction, inj 



recently examined using reuse analysis to determine how for 
before full simulation warmup needs to occur [8]. 

An alternative to Fast forwarding is to use check-pointing 
to start the simulation of a program at a specific point. With 
check-pointing, code is executed to a given point in the pro- 
gram and the state is saved, or checkpointed, so that other 
simulation runs can start there. In this way the initializa- 
tion section can he run just one time, and there is no need 
to fast forward past it each time. The architectural state 
(e.g., caches, register file, branch prediction, etc) can either 
be stared in the trace (if they are not going to change across 
simulation runs) or can be warmed up in a manner similar 
to described above. 

Automatically Finding Where to Simulate: Our work is 
based upon the basic block distribution analysis in [19] as 
described in prior sections. Recent work on finding simula- 
tion points for data cache simulations is presented by Lafage 
and Seznec [13]. They proposed a technique to gather statis- 
tics over the complete execution of the program and use them 
to choose a representative slice of the program. They evalu- 
ate two metrics, one which captures memory spatial locality 
and one which captures memory temporal locality. They fur- 
ther propose to create specialized metrics such as instruction 
mix, control transfer, instruction characterization, and dis- 
tribution of data dependency distances to further quantify 
the behavior of the both the program's full execution and 
the execution of samples. 

Statistical Sampling: Several different techniques have been 
proposed for sampling to estimate the behavior of the pro- 
gram as a whole. These techniques take a number of contigu- 
ous execution samples, referred to as clusters in [4], across the 
whole execution of the program. These clusters are spread 
out throughout the execution of the program in an attempt 
to provide a representative section of the application being 
simulated. Conte et. al [4] formed multiple simulation points 
by randomly picking intervals of execution, and then exam- 
ining how these fit to the overall execution of the program for 
several architecture metrics (IPC and branch and data cache 
statistics). Our work is complementary to this, where we 
provide a fast and metric independent approach for picking 
multiple simulation points based just on basic block vector 
similarity. When an architect gets a new binary to exam- 



ar the SPEC 20D0 floating point (top) and integer (bottom) benchmarks 
;ruction, data and unified L2 cache miss rates. 

ine they can use our approach to quicldy End the simulation 
points, and then validate these with detailed simulation in 
parallel with using the binary. 

Statistical Simutatian: Another technique to improve sim- 
ulation time is to use statistical simulation [16]. Using sta- 
tistical simulation, the application is run once and a syn- 
thetic trace is generated that attempts to capture the whole 
program behavior. The trace captures such characteristics 
as basic block size, typical register dependencies and cache 
misses. This trace is then run for sometimes as little as 50- 
100,000 cycles on a much faster simulator. Nussbaum and 
Smith [15] also examined generating synthetic traces and us- 
ing these for simulation and was proposed for fast design 
space exploration. We believe the techniques presented in 
this paper are complementary to the techniques of Oskin et 
. and Nussbaum and Smith in that more accurate profiles 



can be determined using our techniques, and instead of at- 
tempting to characterize the program as a whole it can be 
characterized on a per-phase basis- 

7. SUMMARY 

At the heart of computer architecture and program opti- 
mization is the need for understanding program behavior. As 
we have shown, many programs have wildly different behav- 
ior an even the very largest of scales (over the full lifetime of 
the program). While these changes in behavior are drastic, 
they are not without order, even in very complex applica- 
tions such as gcc. In order to help future compiler and ar- 
chitecture researchers in exploiting this large scale behavior, 
we have developed a set of analytical tools that are capable 
of automatically and efficiently analyzing program behavior 
over large sections of execution. 

The development of the analysis is founded on a hardware 
independent metric, Basic Block Vectors, that can concisely 
summarize the behavior of an arbitrary section of execution 
in a program. We showed that by using Basic Block Vec- 
tors one can capture the behavior of programs as defined by 
several architectural metrics (such as IPC, and branch and 
cache miss rates). 

Using this framework, we examine the large scale behavior 
of several complex programs like gzip, tzip, and gcc, and 
find interesting patterns in their execution over time. The 



12 



behavior that we find shows that code and program behav- 
ior repeat over time. For example, in the input we exam- 
ined in detail for gcc we see that program behavior repeats 
itselF every 23.6 billion instructions. Developing techniques 
that automatically capture behavior on this scale is useful for 
architectural, system level, and runtime optimizations. We 
present an algorithm based on the identification of clusters 
of basic block vectors that can find these repeating program 
behaviors and group them into sets for further analysis. For 
two of the programs gzip and gcc we show haw the cluster- 
ing algorithm results line up nicely with the similarity matrix 
and correlate with the time varying IPG and data cache miss 
rates. 

It is increasingly common for computer architects and com- 
piler designers to use a small section of a benchmark to 
represent the whole program during the design and evalu- 
ation of a system. This leads to the problem of finding sec- 
tions of the program's execution that will accurately repre- 
sent the behavior of the full program. We show how our 
clustering analysis can be used to automatically find multi- 
ple simulation points to reduce simulation time and to accu- 
rately model full program behavior. We coll this clustering 
tool to find single and multiple simulation points SimPainL 
SimPoint along with additional simulation point data can 
be found at: nttp://BOT.ca.ucBd-edu/ - calder/simpoiiLt/. 
For the SPEC 2000 programs, we found that starting simula- 
tion at the start of the program results in on average error of 
210% when compared to the full simulation of the program, 
whereas blindly fast forwarding resulted in an average B0% 
EPC error. Using a single simulation point found, using our 
basic block vector analysts, resulted in an average 17% IPC 
error. When using the clustering algorithm to create multiple 
simulation points we saw an average IPC error of 3%. 

Automatically identifying the phase behavior using clus- 
tering is beneficial for architecture, compiler, and operating 
system optimizations. To this end, we have used the notion of 
basic black vectors and a random projection to create on ef- 
ficient technique for identifying phases on-the-fly [20], which 
can be efficiently implemented in hardware or software. Be- 
sides identifying phases, this approach can predict not only 
when a phase change is about to occur, but to which phase it 
is about to transition. We believe that using phase informa- 
tion can lead to new compiler optimizations with code tai- 
lored to different phases of execution, multi-threaded archi- 
tecture scheduling, power management, and other resource 
distribution problems controlled by software, hardware or the 
operating system. 

Acknowledgments 

We would like to thank Suleyman Sair and Chris Weaver for 
their assistance with SimpieScalar, as well as Mark Oskin and 
the anonymous reviewers for providing helpful comments on 
this paper. This work was funded in part by DARPA/ITO 
under contract number DABT63-98-C-0045 and NSF CA- 
REER grant No. CCR-9733278. 

8. REFERENCES 

[1] A. Ben-Dor, R. Shamir, and Z. Yokhini. Clustering gene 

expression patterns. Journal of Computational Biology, 

6:281-297, 1999. 
[2] C. M. Bishop. JVeurai Networks for Pattern Recognition. 

Clarendon Press, Oxford, 1995. 
[31 D. C. Burger and T. M. Austin. The simplescolar tool set, 

version 2.0. Technical Report CS-TR-97-1342, University o( 

Wisconsin, Madison, June 1997. 



[4] T. M. Conte, M. A. Hirseh, and K. N. Menezes. Reducing 
state loss For effective trace sampling of superscalar 
processors. In Proceedings of the 1996 International 
Conference an Computer Design (ICCD), October 1996. 

[5] S. Dasgupto. Experiments with random projection. In 
Uncertainty in Artificial Intelligence: Proceedings of the 
Sixteenth Conference (UAI-3000), pages 143-151, San 
Francisco, CA, 2000. Morgan Kaufjnann Publishers. 

[6] G. Hamerly and C. Elkan. Learning the fc in fc-meanB. 

Technical Report CS2002-0716, University of California, Son 
Diego, 2002. 

[7] J. Haskins and IC Slcodron. Minimal subset evaluation; 

Rapid warm-up for simulated hardware state. In Proceedings 

of the 2001 International Conference on Computer Design, 

September 2001. 
[8] J. Hoskins and K. Skadron. Memory reference reuse latency: 

Accelerating sampled microarchitecture simulations. 

Technical Report CS-2002-19, U of Virginia, July 2002. 
[9] A. K. Join, M. N. Murty, and P. J. Flynn. Data clustering: 

a review. ACM Computing Surveys, 31(3):2G4-323, 1999. 
[10] J.-M. Jolian, P. Meer, and S. Bataouche. Robust clustering 

with applications in computer vision. IEEE Transactions on 

Pattern Analysis and Maclane Intelligence, 13(B):791-802, 

1991. 

Ill) R E. Koss and L. Wasserman. A reference Bayesian test for 
nested hypotheses and its relationship tD the schwarz 
criterion. Journal of the American Statistical Association, 
90(431):92B-934, 1995. 

[12] A. KleinOsowski, J. Flynn, N. Meares, and D. Liljo. 
Adapting the spec 2Q00 benchmark suite for 
simulation-based computer architecture research. In 
Proceedings of the /nlfirnaKonal Conference on Computer 
Design, September 2000. 

[13] T. Lafage and A. Seznec Choosing representative slices of 
program execution for microarchitecture simulations: A 
preliminary application to the data stream. In Worfcioad 
Characterization of Emerging Applications, Kluwer 
Academic Publishers, September 2000. 

[14] J. MocQueen. Some methods for classification and analysis 
of multivariate observations. In L. M. LcCam and 
J. Neyman, editors, Proceedings of the Fifth Berkeley 
Symposium on Mathematical Statistics and Probability, 
volume 1, pages 2B1-297, Berkeley, CA, 19 B7. University of 
California Press. 

[15] S. Nussbaum and J. E. Smith. Modeling superscalar 
processors via statistical simulotion. In International 
Conference on Parallel Architectures and Compilation 
Techniques, September 20D1. 

[16) M. Oskin, F. T. Chong, and M. Farrens. HLS: Combining 
statistical and symbolic simulation to guide microprocessor 
designs. In B7tli Annual International Symposium on 
Computer Architecture, June 2000. 

[17] D. Pelleg and A. Moore. X-means: Extending /f-means with 
efficient estimation of the number of clusters. In FreceedinflS 
of the J7tft International Con/, on Machine Learning, poges 
727-734. Morgan Kaufmann, Son Francisco, CA, 2000. 

[18] T. Sherwood and B. Colder. Time varying behavior of 
programs. Technical Report UCSD-CSD9-630, UC Son 
Diego, August 1999. 

[19] T. Sherwood, E. Perelmon, and B. Colder. Basic block 
distribution analysis to Rnd periodic behavior and 
simulation points in applications. In International 
Conference on Parallel Architectures and Compilation 
Tedmiques, September 2001- 

[20] T. Sherwood, S. Sair, and B. Colder. Phase tracking and 
prediction. Technical Report CS2002-0710, UC San Diego, 
June 2002. 

[21} A. Srivastava and A. Eustace. ATOM: A system for building 
customized program analysis tools. In Proceedings of the 
Conference on Programming Language Design and 
Implementation, pages 196-205. ACM, 1994. 

[22] O. Zamir and O. Etzioni. Web document clustering: A 

feasibility demonstration. In Research and Development in 
Information Retrieval, pages 46-54, 199B. 



13 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 12 

TO DECLARATION OF 
JAMES A FLIGHT 



c 



if I 

! 



O O °» o 



ro 
o 
CO 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 13 
TO DECLARATION OF 
JAMES A FLIGHT 



a 

C3 



S3 

H 



3 



•3 

GO 



8" 



•3 >? 



03 -m > bo r 11 t, 
« S 2 «& | - 



■3 Jj 
5 e ° o 5 ^ " 

iiii if 




| I a i § § § f 

« § ^ o Bo 

i 3 

a* - .0- 1- r. -a y 



a 

o 



I 

w 



•a 



■a «? sa 



1 .a -a g 
8.3 



o s ^ g So 

g. S" a o a £ 



° 5 



I* 1 5 5 I 

■g _g o & p £p 
1 1 & 8 - I 

I ^ ^ 5 

o S 3 "3 t§ 

p. ob 3 o S u 

g<|g 

3 1 I 



1*1 



3 u 



s s a 

1^1 

I « g a g s 3 
I S .a J a a 3 

1 -1 1 1 1 S f 

0 s a a a 0 « 
<§ •§ -3 -g I i * 
e2 ^ J g *S bq 



r 



W .a a 



m; 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 14 

TO DECLARATION OF 
JAMES A FLIGHT 



PS 

© 

•I 

a 

03 



H 



-13 



111 




sj * 8e ^ £f| 

llllll! I 

'S 8 % g J*| 

2-3 si ftl 

|l 111 si 1 2 

siiiitiir 

S «S 2 ^ 3 o 

1 ^ss^ i Si 

|.S8#I|1|| 

5 .13 ' 




-j— 1 a ' a ,._ ^ TJ 
o si ^ -3 u J 

o s . ^ s ^ g 



1 1 1 If 1 Hi 

as a & 



a 

S3 

o 



I 



3 -3 .a <*> 



v a s 3 « g 

fl fi o o3tS 



fa 



-1 



•ill 111 




£ ° ^ ill s I 



<g -| a 13 1 8 * 

P S3 J U -§ Si Q 



8 £a 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 15 
TO DECLARATION OF 
JAMES A FLIGHT 



James A. Flight 



From: 
Sent: 
To: 

Subject: 
Attachments: 
Signed By: 



James A. Flight 

Saturday, July 07, 2007 1:45 PM 
'calder@cs.ucsd.edu' 
Request for citation assistance 
CS2002-0710.ps; CS2002-0710 (2).pdf 
jflig ht@hfelaw.com 



Sensitivity: 



Confidential 



Brad, 



We are trying to determine when the full text of your article "Phase tracking and Prediction" was available to the public. 
On its face, it appears to have been published in June of 2003. However, this link 

( http://www.cse.ucsd.edU/Dienst/UI/2.0/Describe/ncstrl.ucsd cse/CS2002-071 0 ) also available through here 
( http://www.cs.ucsd.edu/Dienst/Ui/2.0/ListAuthors/S?authoritv=ncstrl.ucsd cse) makes it appear that the article may have 
been available on request to you as early as June of 2002 (see the postscript file attached as a PDF hereto for your 
convenience). However, following the link in the Postscript file leads to this list (http://www- 
cse.ucsd.edu/users/calder/papers.html ). which identifies the 2003 publication, and no 2002 publication. 

In view of the foregoing, can you help me understand what, if anything, was available in 2002? I suspect it was merely 
the abstract, but perhaps you had the draft ready and were making it available a year before its official 2003 publication? 

Thanks, in advance, for any assistance you are able to provide. 



ZIMMERMAN™ 

150 South Wacker Drive, Suite 2100 
Chicago, Illinois 60606 

(312) 580-1034 (Direct) 
(312) 580-1020 (Main) 
(312) 580-9696 (Fax) 

ifliqht@hfzlaw.com 



Important: This electronic mail message and any attached files contain information intended for the exclusive use of the 
individual or entity to whom it is addressed and may contain information that is proprietary, privileged, confidential and/or 
exempt from disclosure under applicable law. If you are not the intended recipient, please notify the sender, by electronic 
mail or telephone, of any unintended recipients and delete the original message without making any copies. 



Best regards, 



Jim 




l 



Techreport 



To obtain a copy of this techreport, please 
look for it at the following site: 

http://www-cse.ucsd.edu/users/calder/papers.html 

Or send email or a letter to: 
Brad Calder 

University of California, San Diego 

9500 Gilman Drive 

La Jolla, CA 92093-0114 

calder@cs.ucsd.edu 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 16 
TO DECLARATION OF 
JAMES A FLIGHT 



James A. Flight 



From: 
Sent: 
To: 

Subject: 
Signed By: 



James A. Flight 

Monday, July 16, 2007 11:42 AM 

'calder@cs.ucsd.edu 1 

RE: Request for citation assistance 

jfiig ht@hfzlaw.com 



Sensitivity: 



Confidential 



Brad, 



I was wondering if you have had an opportunity to consider this issue. I would appreciate any assistance you are able to 
provide. 

Thank you 

Jim 



From: James A. Flight 

Sent: Saturday, July 07, 2007 1:45 PM 

To: 'calder@cs. ucsd.edu' 

Subject: Request for citation assistance 

Sensitivity: Confidential 

Brad, 

We are trying to determine when the full text of your article "Phase tracking and Prediction" was available to the public. 
On its face, it appears to have been published in June of 2003. However, this link 

( http://www.cse.ucsd.edU/Dienst/UI/2.0/Describe/ncstri.ucsd cse/CS2002-0710 ) also available through here 
( http://www.cs.ucsd.edU/Dienst/UI/2. Q/ListAuthors/S?authority=ncstrl,ucsd cse ) makes it appear that the article may have 
been available on request to you as early as June of 2002 (see the postscript file attached as a PDF hereto for your 
convenience). However, following the link in the Postscript file leads to this list ( http://www- 
cse.ucsd.edu/users/calder/papers.htmn . which identifies the 2003 publication, and no 2002 publication. 

In view of the foregoing, can you help me understand what, if anything, was available in 20027 I suspect it was merely 
the abstract, but perhaps you had the draft ready and were making it available a year before its official 2003 publication? 

Thanks, in advance, for any assistance you are able to provide. 

Best regards, 

Jim 

James A. Flight 



HANLEY 
kmm FLIGHT* 
ZIMMERMAN™ 

150 South Wacker Drive, Suite 2100 
Chicago, Illinois 60606 

(312) 580-1034 (Direct) 
(312) 580-1020 (Main) 
{312} 580-9696 (Fax) 

ifliqht@hfzlaw.com 




l 



f 



Important: This electronic mail message and any attached files contain information intended for the exclusive use of the 
individual or entity to whom it is addressed and may contain information that is proprietary, privileged, confidential and/or 
exempt from disclosure under applicable law. If you are not the intended recipient, please notify the sender, by electronic 
mail or telephone, of any unintended recipients and delete the original message without making any copies. 



2 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 17 
TO DECLARATION OF 
JAMES A FLIGHT 



James A. Flight 



From: postmaster@hfzlaw.com 

Sent: Monday, Juiy 16, 2007 11:45 AM 

To: James A. Flight 

Subject: Delivery Status Notification (Relay) 

Attachments: ATT802303.txt; RE: Request for citation assistance 



This is an automatically generated Delivery Status Notification. 

Your message has been successfully relayed to the following recipients, but the requested 
delivery status notifications may not be generated by the destination. 

calder(Scs . ucsd . edu 



1 



Intel/P16136 
US Application Serial No. 10/424,356 



EXHIBIT 18 
TO DECLARATION OF 
JAMES A FLIGHT 



James A. Flight 



From: 
Sent: 
To: 

Subject: 



CSE Computing Support [webmaster@cs.ucsd.edu] 
Friday, July 06, 2007 7:48 PM 
James A. Flight 

[website #200731]: request for assistance. 



We had a couple of problems with the techreport server. I kicked it and it seems to be 
working now. That techreport is available either by following the link you gave, or by going 
to 

http : / /www. cs . ucsd . edu/f acresearch/technicalreports/tech reports . html 
From there you can search by Author or by Year and also get to the report. 
Let us know if you have any other problems. 



On Fri, 6 Dul 2007 14:20:22 -0700,, 3Fliehtfahfzlaw.com wrote: 
> Hello, 



> The publication Sherwood et al. Phase Tracking and Prediction, is 

> dated Dune 2003. However, according to the Internet, at least some 

> portion of that article was available on your cite 

> ( www. cse. ucsd. edu/Dienst/UI/2.0/Describe/ncstrl. ucsd cse/CS2002-0710 ) 

> in Dune of 2802. According to the Internet Wayback machine, this page 

> was on-line as of November 19, 2002. Can you assist my research by 

> telling me what exactly was available on your cite in the 202 time 

> frame (e.g., by giving me a copy of the postscript file)? 



> Thank you, 



> Dim 



> Dames A. Flight 



> 150 South Wacker Drive, Suite 2100 

> Chicago, Illinois 60606 
> 

> (312) 580-1034 (Direct) 

> (312) 580-1020 (Main) 



-glenn 



l 



(312) 580-9696 (Fax) 
iflxghtf3hfzlaw.com 



Important: This electronic mail message and any attached files contain 
information intended for the exclusive use of the individual or entity 
to whom it is addressed and may contain information that is 
proprietary., privileged., confidential and/or exempt from disclosure 
under applicable law. If you are not the intended recipient s please 
notify the sender, by electronic mail or telephone, of any unintended 
recipients and delete the original message without making any copies. 



2 



